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PREFACE 


These  notes  were  written  for  an  introductory  course  in  probability 
and  statistics  at  the  post-calculus  level  that  was  presented  during  the 
fall  term  of  1974  to  students  in  the  Rand  Graduate  Institute.  Most  of 
the  material  is  devoted  to  the  basic  concepts  of  probability  theory  that 
are  prerequisite  to  learning  mathematical  statistics:  probability 

models » random  variables,  expectation  and  variance,  joint  distributions, 
conditioning,  correlation,  and  sampling  theory.  Among  the  distribu- 
tions treated  are  the  binomial,  hypergeometric,  Poisson,  negative  bi- 
nomial, normal,  gamma,  lognormal,  chi-square,  and  bivariate  normal. 

The  last  section  of  the  notes  provides  an  introduction  to  some  of  the 
basic  notions  of  parameter  estimation:  bias,  efficiency,  sufficiency, 

completeness,  consistency,  maximum  likelihood,  and  least-squares  estima- 
tion. Proofs  of  the  Rao-Blackwell,  Lehmann-Schef fd,  and  Gauss-Markov 
Theorems  are  included. 

The  author  wishes  to  thank  the  following  RGI  students  for  their  con- 
structive comments  on  an  earlier  version  of  these  notes,  their  assist- 
ance in  eliminating  many  (but  surely  not  all)  of  the  errors,  and  their 
patience  and  goodwill:  Joe  Bolten,  Tom  Carhart,  Chris  Conover,  Wendy 

Cooper,  Roger  DeBard,  Steve  Glaseman,  Masaaki  Komal,  Ragnhild  Mowill, 
Captain  Michael  A.  Parmentier,  and  Hadi  Soesastro. 
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SECTION  I,  - INTRODUCTION 

Statistics  is  the  branch  of  applied  mathematics  that  is  concerned 
with  techniques  for  (1)  collecting,  describing,  and  interpreting  data; 
and  (2)  making  decisions  and  drawing  inferences  based  upon  experimental 
evidence.  The  term  ‘‘statistics**  is  also  used  to  refer  to  the  data 
themselves  or  numbers  calculated  from  the  data,  as  in  the  expression 
“lies,  damned  lies,  and  statistics.**  Sometimes  it  is  not  clear  which 
usage  is  intended,  as  in  the  old  saw,  “You  can  prove  anything  with 
statistics.*'  At  any  rate,  statistical  terminology,  measures,  and 
analytical  techniques  have  become  commonplace  in  the  scientific  com- 
munity for  describing  and  interpreting  experimental  results,  and  a 
knowledge  of  statistics  has  become  a prerequisite  for  scientific  re- 
search in  many  fields. 

As  a branch  of  applied  mathematics,  statistics  relies  heavily  on 
mathematical  models.  The  solution  to  a statistics  problem  typically 
involves  four  steps: 

(1)  Statement  of  the  real  problem. 

(2)  Specification  of  a mathematical  model  to  fit  the  problem. 

(3)  Solution  of  the  mathematical  problem. 

(4)  Application  to  the  real  problem. 

Even  if  the  real  problem  is  completely  specified  in  a particular 
application,  the  choice  of  the  mathematical  model  and  therefore  the 
solution  may  still  be  practically  unlimited.  Obviously,  the  mathe- 
matical model  should  contain  the  essential  features  of  the  physical 
situation,  but  in  most  cases  this  will  not  lead  to  a unique  specifica- 
tion of  the  model,  and  it  will  be  meaningless  to  refer  to  a “correct" 
choice.  The  final  choice  of  the  model  will  be  affected  by  the  intui- 
tion and  subject  matter  knowledge  of  the  model  builder  and  perhaps 
by  his  ability  to  carry  out  the  mathematical  solution.  For  now,  let 
us  assume  the  choice  has  been  made. 
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The  next  step,  solving  the  mathematical  problem,  will  often  be 
straightforward,  since  the  model  will  probably  be  chosen  using  ease 
of  solution  as  a criterion.  The  final  step,  identifying  the  solu- 
tion of  the  mathematical  model  with  the  answer  to  the  real  world 
problem,  would  appear  immediate,  but  this  is  often  the  step  where 
the  experimenter  discovers  that  his  presumably  well-conceived  mathe- 
matical model  yields  a solution  that  cannot  possibly  satisfy  the 
real  problem. 

Since  the  mathematical  models  for  statistical  applications  are 
primarily  probability  models  and  since  statistical  theory  depends 
heavily  on  probability  theory,  we  shall  begin  our  study  of  statistics 
with  a consideration  of  those  probability  concepts  that  will  be 
needed  in  the  sequel.  But,  before  we  proceed  along  that  path,  it 
may  be  helpful  to  provide  a single  example  of  a statistical  problem 
to  introduce  some  terminology  and  to  Indicate  the  applicability  of  the 
models  that  will  be  treated. 

Consider  the  problem  of  estimating  the  proportion  of  some  popula- 
tion who  share  a common  attribute  based  upon  a sample  of  a certain 
size  from  that  population.  For  example,  the  population  might  consist 
of  the  voters  in  a certain  state,  and  the  problem  might  be  to  esti- 
mate the  proportion  of  the  voters  who  favor  a given  candidate  based 
upon  the  stated  preferences  of  a relatively  small  number  of  voters. 

As  a second  example,  consider  estimating  the  proportion  of  defective 
transistors  produced  by  a given  machine  based  upon  a sample  of  trans- 
istors chosen  from  that  machine’s  output.  Here,  the  population  of 
interest  is  not  a group  of  people,  but  the  set  of  transistors  pro- 
duced by  the  machine. 

As  these  examples  illustrate,  the  problem  under  consideration 
is  a common  one.  So  as  not  to  confuse  the  issues  involved,  let  us 
pretend  that  the  population  of  interest  is  a big  can  of  marbles  that 
contains  an  unknown  proportion  p of  red  ones  and  that  the  sample 
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will  consist  of  drawing  10  marbles  one  by  one  "with  replacement"  from 
the  can.  A sample  is  said  to  be  draxro  with  (or  without)  replacement 
if,  after  each  draw,  the  marble  is  (or  is  not)  returned  to  the  can. 

In  either  case,  the  sample  is  said  to  be  a random  sample  if  on  each 
draw  every  marble  in  the  can  has  the  same  chance  of  being  selected. 

Your  problem:  estimate  (guess)  the  value  of  p based  upon  a random 

sample  of  size  10  taken  with  replacement. 

As  a first  step  toward  specifying  a mathematical  model  to  fit 
this  situation,  note  that  the  data  of  the  experiment  is  conveniently 
represented  by  a vector  x ■»  (Xj^,  X2,  ...»  Xj^q)  where  x^  is  1 or  0 
according  as  the  i^*^  marble  drawn  is  red  or  not.  Thus,  if  the  first 
two  marbles  drawn  are  red  and  the  others  are  all  white,  then 
X “ (1,1,0,0,0,0,0,0,0,0) . This  is  an  example  of  a sample  point, 
l.e.,  a point  that  summarizes  the  data  for  a particular  realization 
of  an  experiment.  The  set  of  all  possible  sample  points  x is  called 
the  sample  space  for  the  experiment.  Your  estimate  p can  be  taken 
as  any  value  computed  from  the  vector  x.  Three  possibilities  that 
you  might  consider  are  p^^  ■ x - x^/10  or  perhaps  ?2  “ (1  + 8x)/10 

or  even  p^  * 1/2,  which  ignores  the  data  and  guesses  that  p is  1/2 
no  matter  what  the  data  indicates. 

Note  that  the  values  of  p^^,  P2»  and  p^  are  prescribed  by  the 

formulas  above  for  all  sample  points  x.  These  are  examples  of 

statistics , i.e.,  numbers  calculated  from  the  data  points.  These 
particular  statistics  are  also  called  estimators  of  the  parameter  p 
to  differentiate  them  from  other  statistics  in  this  example,  such  as 
Dc^,  Xj^  - X2»  max  (Xj^,X2),  and  52.  The  values  of  the 

estimators  at  a particular  sample  point  are  called  estimates . Thus, 
for  the  sample  point  (1,1,0,0,0,0,0,0,0,0) , the  three  estimates  of  p 
are  p^^  = 1/5,  P2  “ 0.26,  and  p^  “ 1/2.  Of  course,  if  the  actual  pro- 
portion of  red  marbles  in  the  can  is  p = 1/2,  then  p^  provides  the 

best  estimate  of  p.  However  our  intuition  tells  us  that  for  values 
of  p near  0 or  1 the  estimators  Pj^  and  p2  will  usually  pro- 
vide more  reliable  estimates. 
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As  this  example  indicates,  estimates  themselves  have  little  in- 
trinsic interest,  because  one  can  always  specify  an  estimator  that 
will  yield  any  value  whatsoever.  In  some  applications  of  this  model, 
measures  of  goodness  can  be  prescribed  for  comparing  estimators,  in 
which  case  the  problem  of  choosing  an  estimator  reduces  to  solving 
the  mathematical  problem  of  determining  the  one  that  is  best  in  the 
sense  of  these  criteria.  However,  such  instances  are  rare.  In  most 
applications,  clear-cut  goodness  criteria  for  estimators  do  not  exist, 
and  one  is  content  to  report  the  value  of  the  **usual”  estimator  of  p, 
namely,  * x.  As  will  be  seen  later,  this  estimator  has  many 
desirable  properties  and  contains  all  the  information  about  p that 
is  provided  by  the  sample. 

A further  discussion  of  this  problem  is  deferred  until  the 
elementary  probability  concepts  required  for  this  and  other  sta- 
tistical problems  are  treated.  For  a nontechnical  discussion  of 
the  nature  of  statistics,  its  uses  and  misuses,  see  W.  Allen  Wallis 
and  Harry  V.  Roberts,  Statistics,  A New  Approach,  Free  Press,  Glencoe, 
Illinois,  1956,  Chapters  1-3.  For  a pleasant  diversion  that  is  some- 
what related  to  the  subject,  see  Darrell  Huff,  How  to  Lie  with 
Statistics , W.  W.  Norton  and  Co.,  New  York,  1954. 
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Paul  E.  Pfeiffer,  Concepts  of  Probability  Theory,  McGraw-Hill, 
New  York,  1965,  pp.  1-40. 

Certain  physical  experiments  have  the  property  that  their  outcomes 
are  somewhat  unpredictable  and  appear  to  "depend  on  chance."  As  examples, 
consider  flipping  a coin,  throwing  dice,  picking  three  students  by  lot 
from  a class,  spinning  a roulette  wheel,  finding  the  lifetime  of  a light 
bulb,  and  determining  the  time  between  successive  telephone  calls  coming 
Into  an  exchange.  If  we  rule  out  the  uninteresting  cases  for  the  moment 
(e.g.,  two-headed  coins  or  dice  controlled  electronically  so  that  "7"  must 
appear)  , each  of  these  experiments  has  the  property  that  the  outcome  of 
the  experiment  cannot  be  predicted  with  certainty.  Yet,  when  the  experi- 
ment Is  repeated  many  times,  a certain  regularity  may  appear.  For  example. 
If  a slightly  bent  coin  Is  tossed  many  times,  the  relative  frequency  of 
heads,  computed  after  each  toss  and  based  upon  all  the  outcomes  up  to 
that  toss,  may  seem  to  fluctuate  less  and  less  around  a particular  number, 
say  2/3.  Similarly,  the  successive  averages  of  the  times  between  In- 
coming telephone  calls  during  a certain  part  of  the  day  may  appear  to 
"tend"  to  a certain  number. 
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These  experiments  also  suggest  questions  about  the  "chance"  or 
"likelihood"  or  "probability"  of  certain  outcomes  or  collections  of 
outcomes  occurring:  "if  two  dice  are  tossed,  what  is  the  probability 

of  getting  a total  of  seven  or  more?"  "If  telephone  calls  come  Into  an 
exchange  at  an  average  rate  of  A per  minute,  what  Is  the  probability  of 
getting  more  than  10  in  any  one  minute  during  the  next  hour?"  "What  is 
the  probability  of  drawing  a straight  flush  In  poker?" 

Before  tackling  a foirmal  definition  of  probability,  we  shall  first 
put  the  Idea  of  a random  experiment  Into  a mathematical  framework.  To 
the  outcomes  of  Interest  of  the  experiment,  we  make  correspond  the  elements 
of  a set  S called  a sample  apace.  That  Is,  a sample  space  of  an  ex- 
periment Is  a set  S such  that  each  element  of  S corresponds  to  one  of 
the  outcomes  of  the  experiment.  For  example.  If  the  experiment  consists 
of  tossing  a coin,  we  might  take  as  our  sample  space  the  set 
S - {H,  T,  E},  i.e.,  S Is  the  set  consisting  of  the  three  letters  H,  T, 
and  E,  where  "H"  stands  for  "heads,"  "T"  for  "tails,"  and  "E"  for  "edge." 

As  this  example  shows,  the  choice  of  S Is  somewhat  arbitrary. 

As  a second  example  consider  the  experiment  of  throwing  two  dice. 

For  convenience  let  us  assume  that  the  dice  are  painted  red  and  green 
to  distinguish  them.  Then  we  can  designate  the  outcome  that  3 turns 
up  on  the  red  die  and  A turns  up  on  the  green  die  by  the  pair  (3, A). 

Using  similar  designations  for  the  other  possible  outcomes,  we  see  that  an 
appropriate  sample  space  S for  this  experiment  is  the  set  of  pairs  (x,y) 
where  x and  y are  Integers  from  1 to  6.  This  sample  space  can  be 
visualized  by  plotting  the  pairs  as  Indicated  In  the  figure  below.  We  can 
write  S in  set  notation  by  listing  all  the  elements  of  S as  follows: 

S - {(1,1),  (1,2),  (1,3),  ...,  (6,6)}. 

Alternatively,  we  can  write 

S - {(x,y)  : X and  y are  integers  from  1 to  6), 
which  can  be  read  as  "S  Is  the  set  of  pairs  (x,y)  such  that  x and  y 
are  Integers  from  1 to  6." 

The  elements  of  a sample  space  S are  sometimes  called  sample  points 
(or  just  points) , and  an  event  Is  a collection  of  sample  points,  I.e., 
a subset  of  S.  (For  the  moment,  any  subset  of  S will  be  referred  to 
as  an  event;  later,  for  technical  reasons,  the  term  "event"  will  be  reserved 
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for  subsets  of  S In  a certain 
class.)  For  example,  the  pair 
(5,2)  Is  a sample  point  In  the 
sample  space  S above;  this  can 
be  written  as  (5,2)  6 S,  where 
the  symbol  "€"  stands  for  "Is  an 
element  of"  or  "belongs  to."  In 
the  game  of  craps  It  Is  of  Interest 
to  consider  the  event  A corre- 
sponding to  a total  of  seven  on 
both  dice.  (See  the  figure.)  In 
set  notation  this  event  could  be 
written  as: 

A-  {(1,6),  (2,5),  (3,4),  (4,3),  (5,2),  (6,1)} 
or  A ■ {(x,y)  : x + y ■ 7). 

The  event  B designated  In  the  figure  corresponds  to  the  result  that 
the  red  die  turns  up  1: 

B » { (x,y)  : X - 1} . 

In  general,  an  event  A Is  said  to  occur  If  the  outcome  of  the 
experiment  corresponds  to  a sample  point  s In  S such  that  s € A. 

Thus,  If  A and  B are  the  events  defined  above  and  If  the  result  of 

tossing  the  dice  Is  5 on  the  red  die  and  2 on  the  green,  then  A 

occurs  but  B does  not  occur.  If  A and  B are  events  such  that  A 
Is  a subset  of  B,  written  A c:  B or  B 3 A,  then  clearly  whenever  A 
occurs,  B must  also  occur. 

It  will  be  convenient  to  have  notation  for  the  union  and  Intersection 
of  any  two  events  A and  B.  As  the  words  suggest,  the  union  of  A and 
B,  denoted  by  A U B,  Is  the  set  of  all  those  points  that  belong  to  at 

least  one  of  the  sets  A and  B,  whereas  the  Intersection  of  A and  B, 

denoted  by  A D B,  consists  of  those  points  which  belong  to  both  A and 
B.  Thus,  In  the  example  above, 

A U B ■ {(x,y)  :x«l  or  x+y«7} 

A n B ■ {(x,y)  : X ■ 1 and  x + y • 7}  ■ {(1,6)}. 

Note  that  the  event  A U B occurs  If  either  A or  B occurs  (or  both) , 

whereas  A 0 B occurs  If  and  only  If  both  A and  B occur.  Also  note 

that  the  notions  of  union  and  Intersection  can  be  extended  to  more  than 
two  events.  For  example.  If  A,  B,  and  C are  events,  then  Ai  0 B D C 
Is  the  set  of  points  common  to  all  three  sets.  Also,  If  A^,  A2,  ... 


t 
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OO 

Is  a sequence  of  events,  then  U A.  (or  U A-  U...)  is  the  set 

i-1 

OO  • 

of  all  points  that  belong  to  at  least  one  of  the  sets  A , and  fl  A 

^ i-1  ^ 

is  the  set  of  points  that  belong  to  all  the  sets  A^. 

If  two  events  A and  B have  no  points  In  conmon,  we  say  that  the 
events  are  disjoint  (or  mutually  exclusive) . Introducing  the  symbol  0 
to  denote  the  "empty  set"  (l.e.,  the  set  having  no  elements),  we  can  write 
this  as  A n B * 0.  For  example,  in  the  dice  throwing  sample  space  above, 
if 

A ■ {(x,y)  : X + y ■ 7}  and 

B - {(1,1),  (1,2),  (2,1),  (6,6)},  then  A 0 B - 0. 

c 

The  complement  of  an  event  A,  denoted  by  A , Is  the  event  consist- 
ing of  those  points  in  s that  do  not  belong  to  A.  Symbolically, 

A ■ {s  ; s f(A};  here,  "(£"  stands  for  "does  not  belong  to."  Note  that 
A n A*^  ■ 0 and  A U A*^  “ S. 

Example . Let  S be  the  Cartesian  plane,  l.e.,  S ■ {(x,y)  ; x and  y 

2 2 

are  real  numbers).  Then  the  "curve"  y ■ x is  the  set  A ■ ((x,y)  ; y ■ x } 

2 2 

The  set  B ■ {(x,y)  : x + y <1}  is  the  set  of  points  Inside  the  circle 

2 2 

of  radius  1 centered  at  the  origin.  If  C * {(x,y)  ; x + y ■ -1),  then 
C “ 0.  To  "solve"  the  set  of  equations  x + y ■ 5 and  3x  - y ■ 3 means 
to  find  the  Intersection  of  the  sets  D ■ {(x,y)  ; x + y ■ 5}  and  E ■ 

{(x,y)  : 3x  - y - 3),  namely,  D f)  E ■ {(2,3)}.  The  set  F - {(x,y)  ; 

3x  - y < 3}  Is  the  set  of  points  above  the  line  y ■ 3x  - 3;  F is  the 
set  of  points  on  or  below  this  line.  Note  that  F 0 B ■ 0. 

We  shall  want  to  talk  about  the  probability  of  any  event  A,  denoted 
by  P(A).  As  this  notation  suggests,  P will  be  defined  as  a function  of 
events.  To  begin  with,  let  us  assume  that  the  sample  space  Is  finite, 

say  S “ ®2 ®n^‘  ® finite  probability  model  Is  prescribed 

by  assigning  numbers  p^  to  the  sample  points  s^  such  that 
(a)  each  p^  Is  nonnegative,  and 

4-1  n ■ 

In  this  case,  the  probability  P(A)  of  any  event  A is  the  sum  of  the 
assigned  to  the  points  that  belong  to  A. 

For  example,  consider  the  coin-tossing  example  where  the  sample  space 
chosen  was  S * {H,T,E}.  In  this  case,  there  are  only  8 events, 
namely. 
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0,  {H},  {T},  {E),  {T,E},  S. 

If  the  coin  Is  fairly  fat  and  bent  a little,  an  appropriate  assignment 
of  probabilities  p^  to  the  points  H,  X,  and  E might  be  1/2,  1/3,  and 
1/6,  In  which  case  the  probabilities  of  the  events  are 


P(  0)  - 0 

P({H,T»  - 5/6 

P({H})  - 1/2 

P({H,E})  - 2/3 

P({T})  - 1/3 

P({T,E})  - 1/2 

P({E})  - 1/6 

P(S)  - 1 

Although  we  might  want  to  choose  another  P to  fit  a particular  coin, 
this  choice  of  P Is  at  least  consistent  with  some  of  our  Intuitive 
notions  about  probability,  namely: 

I.  0 ^ P(A)  ^ 1 for  all  events  A. 

II.  P(0)  - 0,  P(S)  - 1. 

Ilia.  If  A and  B are  events  such  that  A 0 B > 0,  then 
P(A  U B)  - P(A)  + P(B). 

Similarly,  If  S Is  countably  Infinite,  say  S > {s^,S2,...},  one 
can  assign  probabilities  to  all  subsets  of  S In  a consistent  way  by 
first  assigning  probabilities  p^^  to  the  points  s^  where  p^  i 0 and 
E p^  " 1.  Then,  for  any  event  A,  P(A)  Is  defined  by 

P(A)  - E p.. 

8^6A 


It  Is  easily  checked  that  P satisfies  conditions  I,  II,  and  Ilia  above 
as  well  as: 


III  * If  ^ ^ * * . 

whenever  1 j , then 


are  events  such  that  A,  H A 

X 


0 


P(  U A.)  - E P(A.). 

1-1  ^ 1-1 

In  general,  a set  function  P on  the  class  of  events  of  a sample 
space  S,  countable  or  not,-  Is  said  to  be  a probability  measure  If  P 
satisfies  conditions  I-III  above.  (Condition  Ilia  follows  from  III  by 
setting  A^  “ A^  - . . . - 0 In  III.)  Its  value  P(A)  for  any  event  A 
then  called  the  probability  of  A.  To  sum  up  the  discussion  above. 

If  the  sample  space  S Is  countable  (In  which  case  It  Is  said  to  be 
discrete)  , P can  be  prescribed  by  assigning  nonnegative  values  p^ 


Is 
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that  sum  to  unity  to  the  Individual  sample  points  s^.  In  which  case 
the  probability  of  any  event  Is  the  sum  of  the  probabilities  assigned 
to  the  points  that  belong  to  that  event. 

Using  the  properties  I-III  above,  one  can  easily  show  that  for  any 
probability  measure  P and  any  events  A and  B, 

(1)  - 1 - P(A) 

(2)  P(AUB)  - P(A)  + P(B)  - P(AHb) 

(3)  P(AUB)  i P(A)  + P(B) 

(4)  If  B C A,  P(B)  i P(A). 

For  the  present  we  shall  assume  that  P Is  given  or  that  there  Is 
a "natural"  choice  of  P suggested  by  the  problem.  VThether  the  proba- 
bilities P(A)  actually  fit  the  physical  situation  In  some  sense  or  how 
they  are  measured  In  practice  does  not  enter  the  picture  at  this  stage. 
This  Is  analogous  to  the  situation  In  trigonometry  when  one  Is  given  the 
lengths  of  the  sides  of  a triangle  and  Is  asked  to  determine  the  area. 

The  case  where  the  physical  experiment,  when  properly  viewed,  has 
N outcomes  which  appear  to  be  "equally  likely"  can  be  handled  Immediately 
In  this  framework,  at  least  theoretically.  The  key  words  In  such  problems 
are  "chosen  at  random,"  "fair  coin,"  "honest  dice,"  "selected  by  lot," 
etc.  For  such  situations,  one  can  choose  an  appropriate  sample  space  S 
with  N points  and  assign  probability  1/N  to  each  point.  Then,  for 
any  event  A,  F(A)  - (number  of  elements  In  A)/N. 


Example.  (Dice  throwing)  If  two  dice  are  thrown,  find  the  probability 


of  getting  a total  of  (a)  seven,  (b)  four  or 


Solution.  The  problem  remains  6 

unchanged  If  we  consider  the  dice  dis- 
tinguishable, say  red  and  green.  Let  5 

S * {(x,y)  : X,  y are  Integers  from  1 

to  6}.  4 

For  example,  the  sample  point  (3,4) 
corresponds  to  3 on  the  red  die  and  4 3 

on  the  green.  Assuming  equally  likely 
outcomes  (honest  dice) , we  assign  2 

probability  1/36  to  each  point  • 

1 


1 2 3 4 5 6 


(a) 

The 

event 

A,  "seven  occurs,"  contains  6 points,  so 

P(A) 

- 6/36  - 1/6. 

(b) 

The 

event 

B,  "four  or  ten  occurs,"  also  contains  6 points,  so 

P(B) 

- 1/6. 
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Example.  (Coin-tossing)  If  a fair  coin  is  tossed  four  times  (or 
if  four  fair  coins  are  tossed) , what  is  the  probability  of  getting  at 
least  two  heads? 

Solution.  An  appropriate  sample  space  for  a single  toss  is  S ■ 

L 

For  four  repetitions  of  the  experiment,  we  can  let  S "SXSXSXS* 
{(Xj^,X2,x^,x^)  ; x^  e S}.^ 

The  sample  point  (T,T,T,H),  for  example,  corresponds  to  obtaining  tails 

4 

on  the  first  three  tosses  and  heads  on  the  fourth.  There  are  2 >16 

points  In  S , and  we  assign  probability  1/16  to  each  point.  The  comple- 
ment  A of  the  event  A,  "at  least  two  heads,"  contains  five  points: 
(T,T,T,T),  (H,T,T,T),  (T,H,T,T),  (T,T,H,T),  (T,T,T,H).  Therefore, 

P(A)  - 1 - P(a‘^)  - 1 - (5/16)  - 11/16. 

Exercise.  An  absent-minded  hatcheck  girl  has  4 hats  belonging  to 
4 men.  Since  she  cannot  remember  which  hat  belongs  to  each  man,  she  re- 
turns them  at  random.  Find  the  probability  that 

(a)  exactly  two  men  get  their  own  hats  back.  Ans.  1/4. 

(b)  at  least  two  men  get  their  own  hats  back.  Ans.  7/24. 

(Set  up  an  appropriate  sample  space  and  show  the  correspondence  between 
the  sample  points  and  the  outcomes  of  the  experiment.) 

As  another  example  of  an  experiment  that  fits  the  equally  likely  out- 
comes case,  consider  the  experiment  of  choosing  a sample  of  size  r at 
random  without  replacement  from  some  population  of  n objects,  say 
n " (aj^,a2 , . . . ,a^}  where  n i r.  For  purposes  of  illustration,  let  r ■ 3, 
and  suppose  that  the  experiment  is  conducted  by  first  choosing  one  of  the 
elements  in  n in  such  a way  that  each  element  has  the  same  chance  of 

being  chosen.  Then  a second  element  is  chosen  at  random  from  those  remain- 
ing. Finally,  a third  element  is  chosen  at  random  from  those  remaining 
after  the  first  and  second  have  been  chosen.  If  the  elements  a^,  a^,  a^ 
are  chosen  in  that  order,  this  outcome  can  be  represented  by  the  3-tuple 
(a^,a2,a^).  Similarly,  the  result  of  choosing  a sample  of  size  r can  be 
represented  by  an  r-tuple  (x^^,X2, . . . ,x^)  where  the  components  x^  are 

Hhis  notation  uses  an  obvious  generalization  of  the  notation  for  the 
Cartesian  product  C X D of  two  sets  C and  D as  defined  by: 

C X D - {(c,d)  ; c e C,  d 6 D). 

Thus,  C X D is  the  set  of  all  ordered  pairs  having  the  property  that 
the  first  component  belongs  to  C and  the  second  component  belongs  to 
D. 
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dlfferent  elements  of  the  population.  This  r-tuple  Is  an  example  of  a 
permutation,  l.e.,  an  arrangement  of  r symbols  from  a set  of  size  n 
In  which  repetitions  are  not  allowed. 

The  number  of  permutations  of  n symbols  taken  r at  a time,  de- 
noted by  P(n,r),  can  be  determined  as  follows.  The  first  component  of 
the  r-tuple  can  be  filled  by  any  of  the  n symbols,  the  second  by  any 
of  the  n-1  symbols  not  already  used  In  filling  the  first  component,..., 
the  r^^  by  any  of  the  n-(r-l)  symbols  not  already  used  in  filling  the 
first  r-1  components.  The  total  number  of  different  ways  of  filling 
the  r components  Is 

P(n,r)  - n(n-l) (n-2) • • • (n-r+1)  - nl/(n-r)l  for  r - l,2,...,n 
where  n!  - n(n-l) (n-2) • • • (3) (2) (1)  and  0!  - 1.  The  reason  for  setting 
01  ■ 1 is  to  have  the  formula  P(n,r)  ■ n!/(n-r)!  hold  for  r ■ n.  In 
which  case  P(n,r)  ■ nl . 

If  the  elements  In  the  sample  are  drawn  simultaneously  so  that  the  order 
In  which  the  elements  are  drawn  is  unknown,  the  outcomes  of  the  experiment 
can  be  represented  using  combinations  (subsets)  of  size  r Instead  of 
r-tuples.  For  example.  If  r ■ 3,  the  subset  {a^,a^,a^}  corresponds  to 
drawing  the  elements  a^,  a^,  and  a^  In  some  order.  Note  that  for  each 
subset  of  size  three,  say  {a^,a^,a^},  there  are  31  "6  permutations, 
namely , 

(a^,a^,3y),  (a^,a^,a^),  (a^,a^,a^),  (a^,a^,a2^),  (a^,aj^,a^),  (a^,a^,a2^)« 

Hence,  the  number  of  subsets  of  size  three  is  the  number  of  permutations 
of  size  three  divided  by  31 . In  general.  If  (^)  denotes  the  number  of 
different  subsets  of  size  r from  a set  of  size  n,  then  it  follows  by 
an  argument  similar  to  that  above  for  the  case  r • 3 that 


,nv  P(n,r)  n! 

^r>  “ —El  rlCn-Tyr 


for  r * 0,1 


,n. 


Theorem  2-1.  Given  any  set  of  size  n,  say  n ■ {aj^, . . . ,aj^} , the 
number  of  ordered  r-tuples  (permutations)  (Xj^,...,x^)  such  that  the 
x^'s  are  different  elements  of  0 Is 

P(n,r)  ■ n(n-l) . . . (n-r+l)“  n!/(n-r)I  for  r ■ l,2,...,n. 

The  number  of  subsets  (combinations)  of  size  r from  n Is 
(”)  “ nl/rl(n-r)l  for  r ■ 0,  1 n. 
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The  following  example  Illustrates  how  the  above  results  are  used 
in  sampling  Inspection. 

Example . A box  contains  12  Items  of  which  9 are  defective.  What 
Is  the  probability  that  a random  sample  of  size  4 taken  without  replace- 
ment will  contain  exactly  3 defectives? 

Let  the  set  of  12  Items  be  denoted  by  II  “ 62,02}. 

Two  solutions  will  be  given  below,  the  first  using  subsets  of  n of  size 
4 as  sample  points  and  the  second  using  permutations  of  size  4 as 
sample  points.  Although  the  sample  spaces  are  quite  different,  the  solu- 
tions to  the  problem  yield  the  same  answer. 

Solution  A.  Set  S ■ {x  : x Is  a subset  of  size  4 from  n)» 

The  number  of  points  In  S Is 


ifS 


12! 

418! 


12-11-1Q.9 

4-3-2-1 


495. 


Assign  probability  1/495  to  each  point.  Let  A = {x£S:  x contains  3 D's 

and  1 G}.  Since  #A  =■  (no.  of  ways  of  choosing  3 of  9 D's)  x (no.  of  ways 
of  choosing  1 of  3 G's)  ■ ^3^ “ 252, 

P(A)  - (3)(2)/(J^)  - 252/495  - 28/55. 

Solution  B.  Set  S - { (x^^  ,X2  ,X2,x^)  :x^€Il»  *j^?**j  if*j}.  Then 


#S  - P(12,4)  - 12-11-10-9.  Let  A - {x€S: 
Then 


exactly  three  s 


are ' D's}. 


#A  “ (no.  of  ways  of  choosing  3 of  9 D's)  X (no.  of  ways  of  choosing  1 
of  3 G's)  X (no.  of  ways  of  ordering  the  four  chosen  symbols) 

- (3) (2)41, 

so  that  P(A)  - (^)(J)4!/P(12,4)  - (3)(J)/(J^)  “ 28/55. 

The  above  argument  Is  easily  generalized  to  prove  the  following 
theorem: 

Theorem  2-2.  A random  sample  of  size  n Is  taken  without  replace- 
ment from  a lot  of  N Items  of  which  the  proportion  p are  defective. 

The  probability  p(x)  that  the  sample  will  contain  exactly  x defectives 
Is 

p(x)  - ” 0,l,2,...,n 

where  q ■ 1-p. 
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Now  suppose  that  the  sample  is  taken  with  replacement.  Then  an 
appropriate  sample  space  for  the  experiment  Is 
S “ {(Sj^jS2>«>>  * 

Since  each  component  of  the  sample  points  can  be  filled  In  N ways  and 
repetitions  are  permitted,  the  number  of  points  In  S Is  Let  A 

be  the  event  that  exactly  x of  the  Items  drawn  are  defective.  The 
number  of  points  In  A Is  the  number  of  ways  of  choosing  x of  the  n 
components  to  be  filled  by  D's  [namely,  (”) ] multiplied  by  the  number  of 

yr 

ways  of  filling  the  x chosen  components  with  D's  [namely,  (Np)  ] 
multiplied  by  the  number  of  ways  of  filling  the  remaining  n-x  components 
with  G's  [namely,  (Nq)”  * where  q •«  1-p].  Therefore,  the  number  of 
points  In  A Is 

#A  . (")(Np)*(Nq)""''  , 

and 

P(A)  - #A/n"  - (“)p*q”'*  for  X - 0,l,2,...,n. 

This  proves  the  following  result: 

Theorem  2-3.  If  a random  sample  of  size  n Is  taken  with  replace- 
ment from  a lot  of  N Items  of  which  the  proportion  p are  defective, 
then  the  probability  p(x)  that  the  sample  will  contain  exactly  x de- 
fectives Is 

p(x)  - (”)p^q”~’^  for  X - 0,l,2,...,n 

where  q ••  1-p. 

As  an  example  of  an  experiment  that  requires  an  infinite  sample  space , 
imagine  a person  tossing  a fair  coin  until  a head  occurs.  The  previous 
example  suggests  using  the  sample  space 

- {(H),  (T,H),  (T,T,H),  ...,  (T,T,...)} 

where  (T,T,...)  corresponds  to  never  obtaining  heads.  A slightly  simpler 
sample  space  S - {1,2,3,...,®}  Is  obtained  by  considering  the  so-called 
"waiting  time"  for  heads,  l.e.,  the  number  of  the  trial  on  which  heads 

Since  the  coin  Is  assumed  fair,  we  let  P{1)  ■ 1/2.  Analogy 


first  occurs. 
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4 

with  the  previous  example,  where  we  set  P{(T,T,T,H)}  - 1/16  - 1/2  , 

4 

prompts  us  to  set  P{4}  * 1/2  . Similar  considerations  for  any  n 
leads  us  to  set  P{n}  » 1/2^  for  every  n.  Then,  since 

S P{n}  - E 1/2^^  - 1, 
n-1  n-1 

we  must  have  P{«)  ■ 0,  which  is  consistent  with  our  intuitive  notion 
that,  if  the  coin  is  really  fair,  it  cannot  come  up  tails  infinitely 
many  times. 

Having  assigned  probabilities  to  the  elementary  events,  we  can 
compute  the  probability  of  any  event.  For  example,  the  probability  that 
at  least  4 tosses  are  needed  is 

1 - P{1,2,3}  - 1 - (1/2  + 1/A  + 1/8)  - 1/8. 

Also,  the  probability  that  the  waiting  time  is  odd  is 

. * „-2k+l  _ 1/2 

Pis  : s is  odd)  ■ E 2 ■ ^ 2/3 

Example . According  to  the  U.S.  Bureau  of  the  Census  (Current  Popula- 
tion Reports,  Series  P-60,  No.  78,  May  20,  1971),  the  "distribution"  of 
family  Income  in  1970  in  the  United  States  was  as  follows: 


Family 

Income 

Percent  of 
Families 

Family 

Income 

Percent  of 
Families 

Under  $1000 

1.6 

$7000-7999 

6.3 

$1000-1999 

3.0 

$8000-9999 

13.6 

$2000-2999 

A. 3 

$10000-11999 

12.7 

$3000-3999 

5.0 

$12000-1A999 

lA.l 

$A000-A999 

5.3 

$15000-2A999 

17.7 

$5000-5999 

5.8 

$25000-A9999 

A.l 

$6000-6999 

6.0 

$50000  up 

0.5 

This  distribution  can  be  represented  graphically  using  a histogram  as 
Indicated  in  the  figure  below.  Note  that  the  heights  of  the  rectangles 
above  the  Income  Intervals  have  been  chosen  in  such  a way  that  the  areas 
of  the  rectangles  are  proportional  to  the  percentages  given  in  the  table. 


Although  the  reason  for  doing  so  will  not  be  apparent  at  this  time, 
one  can  build  a probability  model  around  the  distribution  above  by  con- 
sidering the  experiment  of  choosing  a family  "at  random"  from  the  popula- 
tion of  all  families  and  recording,  as  the  outcome  of  the  experiment, 
the  family  Income  of  the  family  selected.  As  a sample  space  for  this 
experiment,  we  can  take  the  set  of  nonnegative  real  numbers:  S * [0,<»).. 

Guided  by  the  table  above,  we  can  choose  our  class  of  events  to  be  the  sets 
[0,  1000),  [1000,  2000),  ...,  and  unions  of  these  Intervals.  To  be 
consistent  with  the  table  above,  we  let  our  probability  measure  P have 
values : 

P([0,1000))  - .016,  P( [1000,2000))  - .030,  etc. 

If  a family  Is  chosen  at  random  from  the  population,  the  event  A corre- 
sponding to  selecting  one  having  Income  less  than  $3000  Is  the  event 
A - [0,1000)  U [1000,2000)  U [2000,3000), 
and  the  probability  of  this  event  Is 

P(A)  - .016  + .030  + .043  - .089, 

which  Is  the  proportion  of  families  In  the  population  having  Income  less 
than  $3000  according  to  the  Bureau  of  Census  estimates. 
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Note  that  our  class  of  events  did  not  Include  every  subset  of  S 
in  this  case.  Our  class  of  events  was  restricted  to  those  subsets  of  S 
whose  probabilities  were  determined  either  directly  from  the  table  or  by 
application  of  the  axioms  for  a probability  measure.  The  next  example 
Indicates  another  reason  for  considering  classes  of  events  that  do  not 
Include  all  subsets  of  the  sample  space. 

Example.  (Spinning  a spinner)  Imagine  trying  to  choose  a real 
number  between  0 and  1 "at  random."  A hypothetical  physical  model 
for  this  would  be  to  spin  a perfectly  balanced  spinner  on  a circle  with 
uniform  markings  from  0 to  1.  Here,  an  obvious  choice  for  the  sample 
space  Is  S > [0,1],  which  Is  uncountable.  In  order  for  the  numbers  to 
be  "equally  likely,"  each  singleton  set  must  have  probability  zero  In 
this  case,  so  that  the  scheme  used  to  assign  probabilities  In  the  discrete 
case  breaks  down.  However,  we  clearly  want  to  have,  for  example,  P[.3,.A]  ■ .1 
and  P(.25,.39]  *■  .14,  which  leads  us  to  assign  probability  to  any  Interval 
(a,b)  (or  (a,b]  or  (a,b)  or  [a,b]]  Its  "length"  b - a.  Follow- 
ing condition  III  for  a probability  measure,  probability  can  also  be 
assigned  to  any  set  which  Is  a countable  union  of  disjoint  Intervals,  and 
this  value  again  coincides  with  out  notion  of  the  "length"  of  the  set. 

Is  there  a consistent  way  of  defining  "length"  for  every  subset  of 
[0,1]?  Unfortunately,  the  answer  Is  no.  (Reference:  H.  L.  Royden,  Real 

Analysis,  Macmillan,  New  York,  1963,  p.  43.)  One  way  out  of  this  difficulty 
Is  to  restrict  the  class  of  events,  l.e.,  the  class  of  subsets  of  [0,1] 
for  which  probability  is  assigned. 

One  such  restriction  Is  to  the  smallest  class  of  subsets  which  con- 
tains the  Intervals  and  is  closed  under  countable  unions,  countable  Inter- 
sections, and  complementation.  For  our  purposes  It  suffices  to  know  that 
such  a class  exists  and  that  there  Is  a way  of  defining  a probability 
measure  on  this  class  which  corresponds  to  our  Intuitive  notion  of  length. 

Note  that  In  this  example  the  probability  of  any  Interval  [a,b]  with 
0 £ a < b ^ 1 can  be  visualized  as  the  area  under  the  "curve"  f(x)  “1 
for  0 ^ X ^ 1 and  between  the  ordinates  x ••  a and  x ••  b,  as  Illustrated 
In  the  figure  below. 
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The  next  example  shows  how  other  curves  can  be  used  to  prescribe  probabil- 
ity measures  on  the  line. 

Example.  Consider  the  waiting  time  in  minutes  between  telephone  calls 
coming  into  an  exchange.  A histogram  based  upon  the  observed  waiting  times 
for  100  calls  coming  into  the  exchange  during  a certain  period  of  the  day 
may  look  like  the  figure  on  the  left  below.  The  figure  is  Intended  to  de- 
pict a case  where  42  out  of  100  waiting  times  were  less  than  one  minute. 
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Let  S “ (0,“).  Theory  to  be  developed  later  In  this  course  sug- 
gests that.  If  the  average  waiting  time  between  calls  Is  2 minutes,  then 
a reasonably  well-fitting  model  might  be  obtained  by  assigning  probabil- 
ities to  Intervals  [a,b]  using  areas  under  the  curve  f(x)  ■ (1/2) e”*^^  as  Is 
Illustrated  In  the  figure  on  the  right  above.  That  Is, 

P([a,b])  - jNl/2)e"’'^^  dx  - e"®^^  - 

a 

The  theory  will  also  suggest  that,  under  certain  assumptions  about  the  wait- 
ing times  between  calls,  a histogram  based  upon  thousands  of  waiting  times 
(using  a finer  partition  of  the  x-axls  than  Is  Indicated  in  the  figure  above) 
should  fit  the  curve  on  the  right  quite  well.  Also,  the  relative  frequency 
of  the  observed  waiting  times  falling  In  a particular  Interval  [a,b]  should 
be  close  to  die  preassigned  probability  P([a,b]). 

As  In  the  spinner  example,  the  probability  of  any  countable  union  of 
disjoint  subintervals  of  S can  be  computed  by  adding  the  probabilities 
of  the  Individual  Intervals.  As  before,  technical  difficulties  preclude 
assigning  probabilities  to  all  subsets  of  S,  but  we  can  again  restrict  our- 
selves to  the  smallest  class  of  events  that  contains  the  Intervals  and  Is 
closed  under  countable  set  operations  (unions.  Intersections,  and  comple- 
ments) . It  can  be  shown  that  any  probability  measure  on  this  class  of 
sets  Is  completely  determined  by  Its  values  on  the  Intervals.  Thus  the 
function  f above  completely  specifies  the  assignment  of  probabilities  to 
this  class  of  sets  through  the  relationship 

P([a,b])  - f(x)  dx. 

The  function  f is  an  example  of  a density  function,  i.e.,  a nonnegative 
function  whose  Integral  over  the  real  line  Is  equal  to  one.  Clearly,  any 
density  function  can  be  ised  to  specify  a probability  measure  on  the  line, 
and  It  Is  often  convenient  In  applications  of  probability  to  use  density 
functions  In  specifying  probability  measures  (or  "distributions”)  on  the 
line. 


The  smallest  class  of  subsets  of  the  line  that  contains  the  Intervals 
and  Is  closed  under  countable  set  operations  Is  often  referred  to  as  the 
class  of  Borel  sets  of  the  line. 
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SECTION  III.  - CONDITIONAL  PROBABILITY  AND  INDEPENDENCE 
References : 

Paul  L.  Meyer,  Introductory  Probability  and  Statistical 
Applications . 2nd  Edition,  Addlson-Wesley , 1970,  Chapter  3. 

Seymour  Llpschutz,  Theory  and  Problems  of  Probability, 

Schaum's  Outline  Series,  McGraw-Hill,  New  York,  1968, 

Chapter  4. 

Emanuel  Parzen,  Modem  Probability  Theory  and  Its  Applica- 
tions, Wiley,  1960,  Chapters  2 and  3. 

William  Feller,  An  Introduction  to  Probability  Theory  and 
Its  Applications.  Vol.  I,  3rd  Edition,  Wiley,  1968, 

Chapter  5. 

Paul  E.  Pfeiffer,  Concepts  of  Probability  Theory.  McGraw- 
Hill,  New  York,  1965,  pp.  41-105. 

Consider  choosing  a person  at  random  from  a population  of  N voters 
of  whom  Np  are  female  and  N^  are  planning  to  vote  for  Charles  Charmer. 
Let  C be  the  event  that  the  person  plans  to  vote  for  Charmer  and  F 
the  event  that  the  person  Is  female.  Then 

^C  ^F 

P(C)  - ^ and  P(F)  - jp  . 

Now  suppose  that  we  are  informed  that  the  person  chosen  was  a woman.  This 
eliminates  many  sample  points  as  possible  outcomes  of  the  experiment,  and 
it  may  not  be  the  case  that  the  proportion  of  women  favoring  Charmer  is  the 
same  as  the  corresponding  proportion  P(C)  for  the  entire  population.  If 
in  fact  N^p  women  plan  to  vote  for  Charmer,  then  our  revised  assessment 
of  the  probability  that  the  person  chosen  will  vote  for  Charmer  is  N ^./N  . 
This  ratio  Is  called  the  conditional  probability  of  C given  F and  Is 
denoted  by  P(C|F).  If  It  happens  that  P(C)  - P(C|F),  so  that  knowing 
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that  the  event  F occurred  does  not  change  our  assessment  of  the  proba- 
bility of  C,  then  the  events  C and  F are  said  to  be  independent . 
These  concepts  are  defined  for  arbitrary  sample  spaces  below. 


Conditional  Probability 


For  any  two  events  A and  B such  that  P(B)  > 0,  the  conditional 
probability  of  A given  B is  defined  by 
P(a|B)  - P(A  n B)/P(B). 

Note  that,  for  fixed  B,  the  conditional  probability  P(a|b)  is  pro- 
portional to  P(A  n B)  with  the  constant  of  proportionality  chosen  to 
make  P(B|B)  * 1. 


In  a finite  probability  model  S ■ (Sj^,  •••»  s^}  with  equally 

likely  points,  the  probability  of  any  event  C is  i?(C)/n  where  ifiC) 


denotes  the  number  of  points  in  C.  Therefore 


P(A|B) 


P(A  n B)  fiU  n B)/n  #(An  B) 
P(B)  " #(B)/n  “ 


so  that  in  this  case  P(a|B)  Is  the  proportion  of  the  points  in  B that 
also  belong  to  A.  In  general,  P(A|B)  Is  the  proportion  of  the  proba- 
bility assigned  to  B that  also  belongs  to  A. 

It  follows  Immediately  from  the  definition  of  P(B|A)  that 


P(A  OB)-  P(A)  P(B|A) . 

More  generally,  if  Aj^,  A2,...,A^  are  any  events  for  which 

^ then 

PCAj^nAjO-.-nAj^)  - P(Aj^)P(A2|A^)P(A3|Aj^nA2)---P(A^|A^n...nA^_^). 

These  results  are  sometimes  useful  in  computing  probabilities  of  joint 
occurrences  of  events  when  it  is  obvious  what  the  conditional  probabilities 
must  be  by  reference  to  the  reduced  sample  spaces. 


Exercises . 1.  Two  fair  dice  are  thrown,  one  red  and  one  green.  What  is 

the  conditional  probability  that  the  sum  is  ten  or  more  given  that  (a)  an 
observer  reported  that  the  red  die  turned  up  as  a five?  (b)  a colorblind 
observer  has  reported  that  one  of  the  dice  turned  up  a five  (not  intend- 
ing to  exclude  the  possibility  that  both  turned  up  fives)?  Ans.  (a)  1/3, 
(b)  3/11. 
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2.  Consider  drawing  two  balls  at  random  without  replacement  from 
an  urn  containing  six  numbered  balls  where  balls  1 to  4 are  white  and 

5 and  6 are  red.  Let  A be  the  event  that  the  first  ball  drawn  is  white 
and  B the  event  that  the  second  ball  drawn  Is  white.  Is  It  not  obvious 
from  the  physical  situation  that  P(B|A)  = 3/5?  Is  It  equally  obvious 
that  P(A|B)  * 3/5?  Do  you  believe  that  P(A)  = P(B)?  Set  up  a sample 
space  for  this  experiment  with  equally  likely  outcomes  and  verify  your 
answers . 

3.  A batch  of  10  light  bulbs  contains  three  defectives.  Bulbs  are 

selected  at  random  without  replacement  and  tested  one  by  one.  Find  the 
probability  that  the  second  defective  occurs  on  the  sixth  draw.  Ans.  1/6. 
Hint:  Let  A be  the  event  that  there  is  exactly  one  defective  in  the 

first  five  draws  and  B the  event  that  there  is  a defective  on  the  sixth 
draw.  Evaluate  P(^B)  using  conditional  probabilities. 

4.  Let  Q be  the  set  function  defined  on  a class  of  events  by 
Q(A)  ■ P(A|B)  where  P is  a probability  measure  and  B is  an  event 

for  which  P(B)  > 0.  Show  that  Q is  a probability  measure,  thus  verifying 
that  conditional  probabilities  "act  like"  probabilities. 

Bayes  * Theorem 

A partition  of  a sample  space  is  a set  of  disjoint  events 
^1*  ^2’  ****  such  that  their  union  is  the  entire  sample  space  S.  For 
example,  any  event  B and  its  complement  6^  constitute  a partition. 

If  the  sample  space  corresponds  to  some  population,  then  any  stratification 
of  that  population,  say  by  race,  income  level,  or  sex,  constitutes  a 
partition  of  S. 

The  following  result,  the  second  part  of  which  is  called  Bayes' 

Theorem,  is  easily  proved. 

Theorem  3-1.  Let  B^^,  B2,  • • • , be  a partition  of  S such  that 
P(B^)  > 0 for  each  i.  Then  for  any  event  A 

(1)  P(A)  - E P(AHB  ) - E P(a|B  )P(B  ) 

j ^ J ^ ^ 

(il)  if  P(A)>0, 

P(B,|A)  . 

E P(A|B  )P(BJ 

j ^ 
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Example.  Suppose  20%  of  the  people  in  a certain  group  are  bad 
drivers.  Of  these,  40%  drive  sports  cars.  Of  the  good  drivers,  5% 
drive  sports  cars.  If  you  pick  a person  at  random  and  he  drives  a sports 
car,  what  is  the  probability  that  he  is  a bad  driver? 

Let  V,  B,  and  G denote  the  events  corresponding  to  sports  car 
drivers , bad  drivers , and  good  drivers  in  a sample  space  S that  corre- 
sponds to  the  population  of  Interest.  Then 
P(V)  “ p(v|b)p(b)  + P(V|G)P(G) 

- (.A)(.2)  + (.05H.8)  =■  .12. 

P(V|B)P(B)  _ (.4)(.2)  , 2 


Thus , 


P(B|V) 


P(V) 


.12 


Exercises.  1.  Prove  the  theorem  above. 

2.  A plant  produces  three  grades  of  components:  20%  of  all  com- 

ponents produced  are  of  grade  A,  30%  of  grade  B,  and  50%  of  grade  C.  The 
percentage  of  defective  components  in  the  three  grades  are  5,  4,  and  2 
percent  respectively.  (a)  What  proportion  of  all  components  produced  in 
the  plant  are  defective?  (b)  If  a component  selected  at  random  from 

the  plant's  output  is  defective,  what  is  the  probability  that  it  is  of 
grade  A?  Ans.  (a)  0.032,  (b)  5/16. 

3.  A certain  disease  is  present  in  about  one  out  of  1000  persons 
in  a certain  population.  A test  for  the  disease  exists  which  gives  a 
"positive"  reading  for  95%  of  the  victims  of  the  disease,  but  it  also 
gives  positive  readings  for  1%  of  those  who  do  not  have  the  disease. 
What  proportion  of  the  persons  who  have  positive  readings  actually  have 
the  disease?  Ans.  0.087. 

Independent  Events , Independent  Experiments , and  Bernoulli  Trials 
Two  events  A and  B are  said  to  be  Independent  if 


P(An  B)  - P(A)P(B). 


-24- 


If  P(B)  >0,  this  condition  is  clearly  equivalent  to  having  P(A|B)  ■ P(A). 
Thus,  A and  B are  independent  if  and  only  if  knowing  that  B has 
occurred  does  not  change  the  probability  that  A will  occur.  In  the 
equally  likely  outcome  case,  two  events  A and  B are  Independent  if 
the  proportion  of  the  points  in  B that  also  belong  to  A is  the  same 
as  the  proportion  of  points  in  the  entire  sample  space  that  belong  to  A. 

Three  or  more  events  Aj^,  A2,...,A^  are  said  to  be  independent  if 
for  any  subsequence  of  k integers  ^ ^2  ^ ^ ^k  1 to  n 

p(A.  Ha.  n...nA.  ) - p(a.  )p(a.  )-*-p(a.  ). 

^1  ^^2  ^k  ^1  ^2  k 

In  particular,  three  events  A,  B,  and  C are  independent  if  the  follow- 
ing four  conditions  hold: 

P(A  n B)  - P(A)P(B) 

P(A  n C)  - P(A)p(C) 

P(B  n C)  - P(B)P(C) 

P(A  n B n C)  - P(A)P(B)P(C) . 

Example.  Referring  back  to  the  probability  model  for  throwing  two 
fair  dice,  one  can  readily  check  that  any  two  of  the  three  events 
A » "3  on  the  green  die,"  B » "4  on  the  red  die,"  and  C ■ "total  of  seven" 
are  (pairwise)  Independent.  However,  it  is  not  the  case  that  P(A  0 B f)  C) 
P(A)P(B)P(C),  because  P(A  n B n C)  - 1/36  whereas  P(A)P(B)P(C)  - 

3 

(1/6)  ■ 1/216.  Hence,  these  three  events  are  not  Independent. 

The  probability  model  for  tossing  two  fair  dice  is  an  Instance  of 

a model  for  two  "independent  experiments."  Let  “ (Sj^,  S2»  ...}  and  S2 

{tf,  t2,  ...}  be  discrete  sample  spaces  for  two  experiments,  and  let 

P-  and  P«  be  the  corresponding  probability  measures  for  the  separate 
^ "^1 

experiments.  Then  a sample  space  for  the  combined  experiment  is 

^In  the  dice-throwing  example,  both  and  $2  consist  of  the 

integers  from  1 to  6,  and  both  probability  measures  P^^  assign  proba- 
bility 1/6  to  each  point  in 
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S - S^xS^  - {(s,t)  : s e Sj^,  t e S2>. 

The  two  experiments  are  said  to  be  Independent  If  probabilities  are 
assigned  to  the  points  of  S using  the  formula: 

P{(s,t)}  - Pj^{s}P2{t}. 

To  see  the  connection  between  Independent  experiments  and  Independent 
events,  let  A be  any  event  In  the  combined  sample  space  S that  depends 
on  the  outcome  of  the  first  experiment  only  (e.g.,  "3  or  more  on  the  red  die") • 
Then  A Is  of  the  form  C X $2  ■ {(s,t)  : s 6 C)  where  C Is  an  event 
In  (e.g.,  C ■ {3, 4, 5, 6}).  Similarly,  let  B ■ X D be  any  event 

that  depends  on  the  outcome  of  the  second  experiment  only  (e.g.,  "2  on  the 
green  die") . Then  It  Is  easily  verified  that 

P(A  n B)  - P(C  X D)  - Pj^(C)  P2<D)  - P(A)P(B). 

Thus,  If  probabilities  are  defined  multlpllcatlvely  on  S using  the  rule 
Indicated  above,  any  event  that  depends  on  the  outcome  of  the  first  experi- 
ment only  Is  Independent  of  any  event  that  depends  on  the  outcome  of  the 
second  experiment  only. 

To  extend  the  notion  of  Independent  experiments  to  more  general 
sample  spaces,  one  Is  led  by  the  discussion  above  for  discrete  sample 
spaces  to  proceed  as  follows.  Let  and  $2  be  any  two  sample  spaces 

with  probability  measures  Pj^  and  P2*  If  C Is  any  event  In  and  D 
Is  any  event  In  $2,  define  the  probability  of  the  "rectangle"  CXD  In  the 
product  space  S ■ X $2  by 

P(C  X D)  - P^(C)P2(D). 

It  follows  from  this  definition  that  any  event  A >•  C X $2  that  depends 
on  the  outcome  of  the  first  experiment  Is  Independent  of  any  event 
B ••  X D that  depends  on  the  result  of  the  second  experiment  only, 
since 

P(A  n B)  - P(CXD)  - P^(C)P2(D)  - [Pi(C)P2(S2)  1 [Pj^(S^)P2(D)  ] - P(A)P(B). 

More  generally,  one  can  combine  the  sample  spaces  S^,  S2,**>, 
for  n separate  experiments  and  define  probabilities  multlpllcatlvely  on 
the  product  space  S ■ SjXS2X. . to  provide  a model  for  n Independent 


-26- 


experlments . It  will  then  follow  that,  if  A2 , • • • , A^  are  events 

such  that  A^  depends  on  the  result  of  the  ^th  experiment  only, 
these  events  are  Independent. 

For  example,  consider  n trials  of  exactly  the  same  type  (e.g. , 
repeated  tosses  of  a coin,  or  successive  draws  at  random  with  replace- 
ment from  a population)  where  each  trial  results  in  one  of  two  outcomes 
of  interest,  say  1 and  0 (for  success  or  failure,  or  heads  and  tails, 
or  employed  and  unemployed) , with  probabilities  p and  q ■ 1 - p on 
each  trial.  Such  trials  are  called  Bernoulli  (or  binomial)  trials. 

A probability  model  for  n Bernoulli  trials  is  prescribed  by  taking 

the  sample  space  S * { (x, , ...,  x ) : x.  = 1 or  0}  and  assigning  proba- 

i n 1 

bilities,  for  example,  as  follows: 

P{ (1,1,0,!,. .. ,0) } - ppqp'*>q. 

To  see  how  to  compute  probabilities  of  certain  events  of  interest,  con- 
sider the  event  A^  that  exactly  three  of  the  n trials  result  in 
successes.  Then  A^  consists  of  all  sample  points  in  S that  have 
exactly  three  I's.  Since  the  probability  assigned  to  any  such  point 
is  p^q”  it  follows  that  P(Aj)  = #(A^)p^q*^  ^ where  #(Aj)  is  the 
number  of  points  in  A^.  But  the  number  of  points  in  A^  is  clearly 
the  number  of  ways  of  choosing  three  of  the  n components  for  the  I's. 
That  is. 


//(A3) 


(") 


ni 


3!(n-3)l  • 

In  particular,  if  n = A,  the  number  of  points  in  A^  is  Al/3!1!  ■ A, 
namely,  (1,1, 1,0),  (1,1, 0,1),  (1,0, 1,1),  and  (0,1, 1,1). 

Similarly,  if  A^  is  the  event  that  there  are  exactly  k successes 
in  n Bernoulli  trials,  then 

P(V 


k n-k 

(fc)  p q 


for  k * 0,1, • ,n. 


For  example,  the  probability  of  n successes  is  p , the  probability 


of 


failures  is  q , and  the  probability  of  at 


1 - 


n 

q • 


least  one  success  is 
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Exerclses.  1.  Find  the  probability  that,  if  four  fair  coins  are 
tossed,  (a)  all  will  turn  up  heads,  (b)  three  will  turn  up  heads. 

Ans.  (a)  1/16,  (b)  1/4. 

2.  Balls  are  drawn  at  random  with  replacement  from  an  um  con- 
taining 1/3  red  balls  and  the  rest  white.  Find  the  probability  that 
(a)  five  successive  draws  will  yield  two  red  balls,  then  three  white 
balls,  (b)  there  are  exactly  two  red  balls  in  the  five  draws,  (c)  there 
are  at  least  two  red  balls  in  five  draws.  Ans.  (a)  8/243,  (b)  80/243, 
(c)  131/243. 

3.  If  only  25X  of  the  voters  favor  a certain  candidate,  what  is 
the  probability  that  a random  sample  of  size  10  will  show  8 or  more 
favoring  him?  Ans.  436/4^^  ■ 0.0004. 


SECTION  IV 


RANDOM  VARIABLES  AND  THEIR  DISTRIBUTIONS 


References ; 

Paul  L.  Meyer,  Introductory  Probability  and  Statistical 
Applications,  2nd  Edition,  Addlson-Wesley,  1970,  Chapter  4. 

Seymour  Llpschutz,  Theory  and  Problems  of  Probability, 
Schaum's  Outline  Series,  McGraw-Hill,  New  York,  1968, 
Chapter  5. 

Paul  E.  Pfeiffer,  Concepts  of  Probability  Theory.  McGraw- 
Hill,  New  York,  1965,  Chapter  3. 


Consider  the  dice- throwing  example  again,  where  the  sample  space 
chosen  was  S » {(x,y)  : x,y  € {1,2, . . . ,6}} . In  the  game  of  "craps," 
one  Is  not  Interested  In  the  particular  outcome  (x,y)  that  occurs, 
because  only  the  sum  Is  relevant.  This  leads  us  to  consider  the  "random 
variable"  Z on  S defined  for  all  points  (x,y)  by  Z(x,y)  ■ x + y. 
In  general,  a random  variable  Is  a real-valued  function  defined  on  a 
sample  space. ^ Roughly  speaking,  the  key  Idea  behind  the  notion  of  a 
random  variable  is  that  It  Is  a variable  that  depends  on  the  result  of 
a random  experiment;  its  value  for  a particular  outcome  of  an  experiment 
Is  a number  computed  from  the  data  point. 


This  definition  suffices  for  discrete  sample  spaces,  where  all 
subsets  of  S are  events,  and  for  the  applications  of  probability  models 
to  be  considered  In  this  course.  For  arbitrary  sample  spaces.  In  which 
not  all  subsets  are  events,  probablllsts  prefer  to  define  a random  vari- 
able X as  a real-valued  function  on  S such  that  the  subset 
{a  : X(s)  i c}  Is  an  event  for  every  real  number  c.  The  purpose  of 
this  additional  restriction  Is  to  assure  that,  under  certain  reasonable 
assumptions  on  the  class  of  events,  probabilities  of  the  form  P(X  ^ c) , 
P(X  < c) , and  P(a  < X < b)  are  all  defined  for  any  random  variable  X, 
as  well  as  any  probabilities  of  the  form  P(X  £ B)  where  B is  a count- 
able union  of  intervals  (open,  half-open,  or  closed)  on  the  line.  For 
our  purposes,  we  can  consign  this  bit  of  pedantry  to  a footnote  and  refer 
the  mathematically  oriented  reader  to  books  on  probability  theory,  e.g., 
the  book  by  Pfeiffer  cited  above. 
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Some  other  random  variables  on  the  same  sample  space  are: 
X(x,y)  ■ X 
Y(x,y)  - y 


if  X + y ■ 7 or  11 
otherwise. 


Note  that  we  have  used  capital  letters  X,  Y,  and  Z to  denote  random 
variables  rather  than  the  usual  function  notation  of  calculus  (e.g., 
f,  g,  h) . This  usage  has  become  traditional  in  probability  and  statistics 
to  distinguish  the  random  variables  from  their  values,  which  in  turn  are 
often  denoted  in  lower-case  letters. 

Sometimes  random  variables  are  defined  Implicitly  as  functions  of 
other  random  variables.  For  example,  Z could  have  been  defined  above 
using  usual  function  notation  as  Z ■ X + Y. 

Ordinarily  random  variables  are  defined  verbally  rather  than  explicitly 
using  function  notation.  Thus,  one  might  refer  to  the  number  of  successes 
X in  n Bernoulli  trials.  Relative  to  the  sample  space  S at  the  end 
of  the  previous  section,  this  means  that  for  any  sample  point 
s ■ (Xj^,  x^,  ...,  x^)  consisting  of  I's  and  O's,  X(s)  ■ (number  of  I's 
in  s) . Note  that  if  X^  denotes  the  result  of  the  j^th  trial  (i.e.,  X^(s)  ■ x^) , 
then  X > Yhls  Illustrates  how  a random  variable  can  sometimes  be 

represented  as  a function  of  other  random  variables  of  a simpler  nature. 

Here,  each  X^  has  only  two  possible  values  0 and  1.  The  utility  of  such 
representations  will  be  exhibited  later. 

The  following  examples  of  random  variables  refer  to  problems  dis- 
cussed in  Section  II. 

1.  Hatcheck  girl  problem. 

Let  S be  the  set  of  the  4!  permutations  of  the  Integers  1,2, 3, 4, 
namely,  (1,2, 3, 4),  (2, 1,3, 4),  etc.  The  point  (2, 4, 3,1),  for  example 
corresponds  to  the  outcome  that  the  first  man  receives  the  second  man's 
hat,  the  second  man  receives  the  fourth  man's  hat,  the  third  man  receives 
his  own  hat,  and  the  fourth  man  receives  the  first  man's  hat.  Let  X 
be  the  random  variable  corresponding  to  the  number  of  hats  returned 
correctly,  so  that  X(2,4,3,l)  - 1,  X(l,2,3,4)  ■ 4,  etc. 
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2 . Spinner  problem. 

The  sample  space  chosen  to  correspond  to  the  set  of  possible 
readings  of  the  spinner  was  the  unit  Interval  [0,1]. 

(a)  Let  X be  the  number  chosen  at  random: 

X(s)  ■ s for  all  s. 

(b)  Y(s)  « sin  2tt8.  [No  one  said  that  random  variables  had  to 
be  of  particular  interest  for  the  experiment  under  consideration.  This 
one  happens  to  be  of  interest  in  another  context,  that  of  choosing  a di- 
rection at  random,  specified  by  a point  (cos  2tts,  sin  2ns)  on  the  unit 
circle. ] 


(c)  X^(s)  « 8^. 

(d)  Z(s)  - [J 


[Note  the  strange,  but  unambiguous,  notation.] 

if  0 ^ 8 ^ 1/4, 
if  8 > 1/4. 


3.  Telephone  problem. 

(a)  Let  X be  the  waiting  time  in  minutes  until  a telephone 
call  comes  into  the  exchange,  i.e.,  X(s)  * s for  all  s > 0. 

(b)  y ■ X/60,  the  corresponding  waiting  time  in  hours. 

(c)  Z - integral  part  of  X.  For  example,  if  X(s)  ■ 6.875 
(minutes),  then  Z >■  6. 

Just  as  a random  variable  X "maps"  (or  "carries")  sample  points 
from  S into  the  real  number  line,  it  also  carries  probabilities  on  S 
into  the  real  line  R,  Inducing  a probability  measure  on  R that  is  called 
the  distribution  of  the  random  variable  X.  As  we  shall  see,  distributions 
of  random  variables  play  a central  role  in  statistical  theory. 

To  get  a feeling  for  the  notion  of  a distribution  of  a random  variable, 
let  us  return  once  again  to  the  dice-throwing  example  and  consider  the  sum 
of  the  outcomes  on  the  two  dice,  Z(x,y)  * x + y.  The  figure  on  the  next 
page  attempts  to  depict  the  way  that  the  random  variable  Z maps  points 
In  S Into  R and  thereby  Induces  a probability  distribution  on  R. 

The  top  part  of  the  figure  indicates  the  correspondence  between  events 
in  S and  the  possible  values  of  Z:  2,3,  ...,  12.  Since  Z has  value 

4 on  the  event  {(3,1),  (2,2),  (1,3)},  and  this  event  has  probability 
P(Z  « 4)  - 3/36  ■ 1/12,  the  number  4 receives  probability  1/12  under 
the  distribution  induced  by  Z.  The  function  depicted  in  the  bottom 
half  of  the  figure  indicates  the  probabilities  assigned  to  the  other 
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values  of  Z.  This  is  a graph  of  the  "probability  function"  of  Z,  one 
method  of  characterizing  the  distribution  of  a "discrete"  random  variable. 


Sample 

space 

S 


Random 

variable 

Z 


p(z) 


Figure  IV  - 1 
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In  general,  a random  variable  X is  said  to  be  discrete  If  there 
Is  a countable  set  of  real  numbers,  say  A = {x^,  •••},  such  that 

P(X  6 A)  » 1.  In  this  case,  the  function  p on  A defined  by 

p(x)  - P(X  - x) 

Is  called  the  probability  function  of  X.  Some  obvious  properties  of  the 
probability  function  are; 

(a)  p(x)  ^0  for  all  x In  A, 

(b)  E p(x)  - 1. 

X t A 

The  probability  function  of  the  random  variable  Z In  the  dice- 
throwing example  was  depicted  at  the  bottom  of  Figure  lV-1.  As  a second 
example,  let  X be  the  number  of  hats  returned  correctly  In  the  hatcheck 
girl  problem.  As  was  seen  In  an  exercise  In  Section  II, 
p(2)  « P(X  “ 2)  “ 1/4. 

Other  values  of  the  probability  function  of  X are  given  below. 


X 

p(x) 

0 

3/8 

1 

1/3 

2 

1/4 

3 

0 

4 

1/24 

Clearly,  any  random  variable  on  a sample  space  that  has  only  countably 
many  points  must  be  discrete.  As  an  example  of  a discrete  random  variable 
with  Infinitely  many  values,  consider  the  waiting  time  for  heads  In  re- 
peated Independent  tosses  of  a fair  coin.  Examples  of  discrete  random  vari- 
ables on  uncountable  sample  spaces  are  given  by  Examples  2(d)  and  3(c) 
above.  The  other  examples  of  random  variables  for  the  spinner  and  tele- 
phone problems  are  not  discrete,  and  since  P(X  »•  x)  • 0 for  all  values 
of  X for  both  the  random  variable  X In  the  spinner  problem  and  the 
waiting  time  In  the  telephone  problem,  we  shall  require  characterizations 
other  than  the  probability  function  to  specify  their  distributions. 
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The  distribution  function  (or  cumulative  distribution  function) 
of  a random  variable  X Is  defined  for  all  real  x by 
F(x)  - P(X  £ x). 

The  Importance  of  the  distribution  function  of  X Is  that  It  provides 
a simple  characterization  and  description  of  the  distribution  of  X, 
whether  X Is  discrete  or  not. 

Examples . 1.  Let  X be  the  random  variable  In  the  hatcheck 
girl  problem.  The  graph  of  the  distribution  function  F of  X Is 


Comparing  this  graph  with  that  of  the  probability  function  above,  we  note 
that  the  distribution  function  has  Jumps  at  0,  1,  2,  and  4,  the  values 
which  X takes  on  with  positive  probabilities. 


2.  If  X Is  the  random  variable  In  the  spinner  problem,  then 


F(x)  ■ P(X  S x)  ■ 


0 If  X < 0 

X If  0 ^ X 1 

1 If  X > 1. 


0 
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3. 

Then 


so  that 


Let  X be  the  random  variable  in  the  telephone  problem. 
F(x)-/*_f(t)  dt  where  f(t)« 

X < 0 
X ^ 0. 


F(x) 


F(x) 


1 - e 


-x/2 


if 

if 


0 if 

1/2  if 


t < 0 
t 2;  0 


Although  the  three  distribution  functions  above  are  quite  different 
in  nature,  they  share  a number  of  common  properties.  In  general,  the 
distribution  function  F of  a random  variable  X must  satisfy  the 
following  properties: 

(a)  0 :£  F(x)  :£  1 for  all  real  numbers  x. 

(b)  F is  monotonlcally  increasing,  l.e.,  if  a < b,  then  F(a)  a F(b). 

(c)  F(-»)  - 0,  F(«)  - 1. 

(d)  F is  right  continuous,  l.e.,  F(x  -f  0)  - F(x)  for  all  x [here, 
F(x  -f  0)  denotes  11m  F(y)  as  y tends  to  x from  above]. 

(e)  P(a  < X a;  b)  - F(b)  - F(a). 

(f)  P(X  ■ b)  - F(b)  - F(b-O)  [this  is  the  jump  in  F at  b]. 

If  X is  discrete  and  has  probability  function  p,  then 

F(x)  - E p(x.). 

*1=“ 

In  most  Instances,  the  probability  ftmction  is  preferable  to  the  distribution 
function  in  describing  a particular  discrete  distribution.  We  now  turn  to 
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another  class  of  distributions  for  which  a characterization  other  than 
the  distribution  function  Is  usually  preferable. 

A random  variable  X with  distribution  function  F Is  said  to 
have  a continuous  distribution  If  there  Is  a nonnegative  function  f 
on  R (called  the  density  function  of  R)  such  that 

(1)  F(x)  - f(t)  dt 

for  each  real  value  of  x. 

Examples  2 and  3 above  provide  examples  of  random  variables  having 
continuous  distributions.  The  density  function  of  the  random  variable  X 
In  Example  2 Is  given  by 

1 1 If  0 i X a:  1 
0 otherwise. 

The  density  function  In  Example  3 Is  clearly  specified.  Since 
P(a  < X i b)  - F(b)  - F(a)  - f(x)  dx, 
these  probabilities  can  be  visualized  as  areas  under  the  curve  f(x)  and 
between  the  ordinates  X ■■  a and  x - b , as  was  Illustrated  earlier  In 
Section  II. 

It  follows  from  (1)  above  that.  If  X has  a continuous  distribution, 
then  Its  distribution  function  F Is  continuous.  However,  the  converse 
of  this  statement  Is  not  true  since  there  are  continuous  distribution 
functions  F for  which  no  density  function  f exists.  (An  attempt  to 
depict  such  a function  F Is  given  on  page  193  In  Introduction  to  Measure 
and  Integration  by  M.  E.  Munroe.)  Therefore  some  writers  prefer  to  say 
that  X has  an  absolutely  continuous  distribution  when  (1)  holds. 

Some  observations  which  follow  from  (1)  are: 

(a)  ^ M at  every  continuity  point  x of  f : 

(b)  /"^f(x)dx-l; 

(c)  If  X has  a continuous  distribution,  then 
P(X“  x)  ■ F(x)  - F(x-O)  “ 0 

for  every  real  value  of  x. 
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To  Indicate  an  application  which  gives  rise  to  a random  variable 
which  has  a distribution  which  is  neither  continuous  nor  discrete,  con- 
sider measuring  the  lifetime  of  a lightbulb,  where  it  is  reasonable  to 
assume  that  there  is  a nonzero  probability  that  the  bulb  will  not  burn 
at  all.  A distribution  function  like  the  one  pictured  below  might  be 
appropriate  in  this  situation. 


Exercises.  1.  Five  balls  are  chosen  at  random  from  an  um  contain- 
ing 9 balls  of  which  3 are  white.  Let  X be  the  number  of  white  balls 
in  the  sample.  Find  and  sketch  the  probability  function  of  X if  the 
balls  are  chosen  (a)  with  replacement,  (b)  without  replacement. 

Ans.  (a)  32/243,  80/243,  80/243,  40/243,  10/243,  1/243. 

(b)  1/21,  5/16,  10/21,  5/42,  0,  0. 

2.  Suppose  Y has  a density  function  of  the  form 
f(y)  ” cy  for  0 < y < 1. 

(a)  What  is  the  value  of  c? 

(b)  Find  P(Y  < 1/2). 

(c)  Find  and  sketch  the  distribution  function  of  Y. 

(d)  Find  and  sketch  the  density  function  of  U ■■  3Y . [Note  that 

P(U  s:  u)  - P(Y  i u/3).] 

(e)  Find  and  sketch  the  density  function  of  V > Y -t-  1. 

Ans.  (a)  2,  (b)  1/4,  (c)  F(y)  ■ 0 for  y S 0,  y^  for  0 < y < 1,  1 for  y as  1. 

(d)  2u/9  for  0 < u < 3,  (e)  2(v-l)  for  1 < v < 2. 
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3.  Let  Y ■ / X where  X is  the  random  variable  in  the  spinner 
problem. 

(a)  Find  P(Y  < 1/2) . 

(b)  Find  P(Y  < 1/2 |X  < 3/4). 

(c)  Find  and  sketch  the  distribution  function  of  Y. 

(d)  Find  and  sketch  the  density  function  of  Y. 

Ana.  (a)  1/4,  (b)  1/3,  (c)  same  as  2(c),  (d)  2y  for  0 < y < 1. 

4.  Let  X be  the  random  variable  in  the  telephone  problem.  Show 
that  P(X  > a+b|X  > a)  ■ P(X  > b)  for  all  positive  values  of  a and  b. 
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SECTION  V.  - CHARACTERISTICS  OF  DISTRIBUTIONS 
References: 

Paul  L.  Meyer,  Introductory  Probability  and  Statistical 
Applications . 2nd  Edition,  Addison-Wesley , 1970,  Chapter  7. 

Seymour  Llpschutz,  Theory  and  Problems  of  Probability, 

Schaum's  Outline  Series,  McGraw-Hill,  New  York,  1968, 

Chapter  5. 

Paul  E.  Pfeiffer,  Concepts  of  Probability  Theory.  McGraw- 
Hill,  New  York,  1965,  Chapter  5. 

Consider  the  experiment  of  drawing  a tag  at  random  from  a box  con- 
taining N tags  of  which  1/2  are  marked  "I,"  1/3  are  marked  "2,"  and 

1/6  are  marked  "3."  Let  X be  the  number  on  the  tag  that  Is  drawn.  With 
an  appropriate  sample  space  for  this  experiment  consisting  of  N equally 
likely  outcomes,  X Is  a random  variable  having  probability  function 

X 1 2 3 . 

~p00  U2  173  176 

The  "expected  value"  of  X,  denoted  by  E(X) , will  be  defined  below  as  a 
weighted  average  of  the  possible  values  of  X using  the  probabilities 
p(x)  as  weights.  In  this  case, 

E(X)  - 1(1/2)  + 2(1/3)  + 3(1/6)  - 5/3. 

Before  proceeding  with  a formal  definition,  we  note  two  interpreta- 
tions of  E(X)  In  this  example.  First,  the  arithmetic  average  (mean)  of 
all  the  numbers  on  the  N tags  In  the  box  is 

^XN/3)  + 3(N/^  , 1(1/2)  + 2(1/3)  + 3(1/6)  - 5/3. 

N 

Thus,  In  this  case  E(X)  - 5/3  coincides  with  the  ordinary  average  of  the 
tag  numbers  In  the  box.  Next,  suppose  we  repeat  the  experiment  Independently 
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a large  number  of  times,  say  n,  and  let  n^,  n2>  be  the  number  of 

times  that  tags  numbered  1,  2,  and  3 are  dravm.  Then  the  average  of  the 
numbers  dravm  on  the  n trials  Is 
In.  + 2n,  + 3n_ 

l(nj^/n)  + 2(n2/n)  + ICn^/n). 

In  a large  number  of  trials,  we  would  anticipate  that  the  sample  propor- 
tions nj^/n,  n2/n,  and  n^/n  would  be  close  to  the  probabilities  1/2,  1/3, 
and  1/6.  Therefore,  we  would  expect  that  the  average  of  the  numbers  dravm 
would  be  close  to  E(X)  - 5/3.  The  validity  of  this  second  Interpretation 
of  E(X)  will  be  established  later. 

Definition.  Let  X be  a discrete  random  variable  having  possible 
values  Xj^,  X2»  ...  and  probability  function  p.  Then  the  expected  value 
(expectation,  mean)  of  X is  defined  by 
E(X)  ~ W p(Xj^) 

provided  that  E Xj^  converges  absolutely.  If  E|xj^|p(Xj^)  diverges 

we  say  that  the  expected  value  of  X does  not  exist  (or  that  the  expecta- 
tion of  X Is  Infinite). 

Examples . 

1.  A random  variable  X is  said  to  have  a Bernoulli  distribution 
with  parameter  p if  P(X  - 1)  ■ p and  P(X  - 0)  ■ q ■ 1 - p.  In  this 
case, 

E(X)  ■ l*p  + 0*q  ■ p, 

2.  Suppose  X has  probability  function  p(x^)  ■ 1/n  where 

Xj^,X2 are  n (distinct)  real  numbers.  Then  E(X)  - E x^/n. 

3.  Let  X be  the  waiting  time  for  a "1"  if  a fair  die  Is  tossed 
repeatedly  until  "1"  occurs  for  the  first  time.  Then  X has  probability 
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X““l 

function  p(x)  = q p where  p = 1/6  and  q * 5/6.  Therefore, 

E(X)  = xq*~S  “ 1/p  “ 6.  [In  general,  - 1/(1  - z)  for 

|z|  <1;  taking  derivatives  on  both  sides  In  this  equation  yields 
kz^  ^ 1/(1  - z)^  for  |z|  < 1.] 

4.  An  example  of  a discrete  random  variable  that  does  not  have 
an  expectation  Is  provided  by  letting  X be  a random  variable  such  that 
P(X  ■ 2*^)  ■ 1/2*^  for  n •*  1,  2,  ...  In  this  case,  each  term  In  the 

p(x^)  Is  equal  to  one,  and  hence  the  series  does  not  converge. 
Exercises. 

1.  Let  X be  the  number  of  heads  that  occur  In  three  tosses  of  a 
fair  coin.  Show  that  E(X)  * 3/2. 

2 . Five  balls  are  chosen  at  random  from  an  urn  containing  9 balls 

of  which  3 are  white.  Let  X be  the  number  of  white  balls  In  the  sample. 
Show  that  E(X)  * 5/3  whether  the  sampling  Is  done  with  or  without  replace- 
ment. [You  derived  the  probability  function(s)  of  X in  Exercise  1,  page 
36.] 

3.  If  two  fair  dice  are  tossed  and  Z Is  the  sum  of  the  results, 
show  that  E(Z)  = 7.  (See  page  31  for  the  probability  function  of  Z.) 

Now  suppose  that  the  two  dice  are  colored  red  and  green.  Let  X be  the 
result  on  the  red  die,  and  Y the  result  on  the  green  die.  Show  that 

E(X)  = E(Y)  - 7/2,  thus  verifying  the  E(X  + Y)  = E(X)  + E(Y)  In  this  case. 

To  derive  some  of  the  fundamental  properties  of  expectation,  let  us 
first  restrict  our  attention  to  discrete  sample  spaces  S « {Sj^,  S2»  ...), 
so  that  the  random  variables  Involved  will  necessarily  be  discrete. 
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For  purposes  of  Illustration, 
let  X be  a random  variable 
on  S having  only  three  pos- 
sible values  x^,  x^,  and  x^, 
and  consider  the  partition  of 
the  sample  space  into  the  sets 
“ {s  : X(s)  ■ x^}.  Denoting 
the  elements  of  by 

8^2 » • • • , we  have  that 
E(X)  - p(x^) 

- Xj^P(Aj^)  + X2P(A2>  + X2P(A2) 

- Xj^(P{Sj^j^}  + + ...  ) 

+ X2(P{s2j^}  + ^^822^  + ...  ) 

+ XjCPCs^j^}  + P{8j2}  + ...  ) 

- E X(s  ) P{s  }. 

l,j  ^ ^ 


This  shows  that,  in  discrete  probability  models,  our  definition  of  E(X) 

is  equivalent  to  setting 

E(X)  - E X(8)P{s}. 

8 

This  means  that  E(X)  can  also  be  Interpreted  as  a weighted  average  of  the 
values  of  X at  each  of  the  sample  points  where  the  weights  are  the  proba- 
bilities P{s}. 
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One  o£  the  implications  of  this  second  representation  is  that,  if 

X and  Y are  any  two  random  variables  on  S having  finite  expectations, 

and  if  Z - X + Y,  then  E(Z)  - E(X)  + E(Y),  because 

E(Z)  - EZ(s)P{s}  - E[X(s)  + Y(s)]P{s}  - DC(s)P{s}  + rY(s)P{s} 
s 

» E(X)  + E(Y). 

Also,  if  W « aX  -f  b where  a and  b are  any  constants,  then 

E(W)  - EW(s)P{s}  - E[aX(s)  + b]P{s}  - aEX(s)P{s)  + bEP{s} 

8 

- aE(X)  + b. 

This  motivates  the  following  results,  which  are  true  for  all  probability 
models,  not  just  discrete  ones.  (Ref.  Pfeiffer,  Chapter  5.) 

Theorem  5-1.  If  X and  Y are  any  two  random  variables  that  have 
finite  expectations,  then 

(a)  E(X  + Y)  - E(X)  + E(Y),  and 

(b)  E(£iX  + b)  “ aE(X)  + b for  any  constants  a and  b. 

Corollary.  If  Xj^,  X2,  . . . , X^  are  n random  variables  having  finite 

expectations,  then  ECX,  + X-  + ...  + X ) - E(X, ) + E(X.,)  + ...  + E(X  ). 

1 z n 1 / n 

Example . A gambler  at  the  "craps"  tables  In  Las  Vegas  can  place  a 
4-to-l  bet  on  the  occurrence  of  "7"  when  two  fair  dice  are  tossed.  If 
he  bets  a dollar  and  "7"  occurs,  he  wins  $4;  otherwise,  he  loses  $1.  Let 
G be  his  gain  in  dollars  on  a single  trial.  Since  the  probability  of  winning 

on  each  toss  Is  1/6,  P(G  ■*  4)  ■ 1/6  and  P(G  ■ -1)  ■ 5/6,  so  that 

E(G)  - 4(1/6)  - 1(5/6)  - - 1/6. 

Alternatively,  one  could  set  X equal  to  1 or  0 according  as  the  result 
Is  "7"  or  not.  In  which  case  G 5X  - 1 and 

E(G)  - 5E(X)  - 1 - 5(1/6)  - 1 - - 1/6. 
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Suppose  that  he  bets  a dollar  on  every  toss  of  the  dice  for  an  hour  where  tosses 
occur  at  a rate  of  two  a minute,  and  let  be  his  gain  on  the  ^th  trial. 

Then  his  expected  overall  gain  on  the  120  trials  Is 

Gj)  - e(Gj)  - 120- (-1/6)  - - 20. 

Another  Implication  of  the  representation  E(X)  ■ EX(s)  P{s)  for 
discrete  probability  models  Is  that.  If  Y » g(X)  where  g Is  some  real- 
valued function  on  R,  then 

E(Y)  - Df(s)P{s}  - i:g(X(s))P{s} 
s 

« g(xj^)P{s:X(s)  * Xj^}  + g(x2)P{s;X(s)  - X2>  + ... 

- E g(x^)p(Xj^), 
k 

where  p Is  the  probability  function  of  X.  That  is,  one  can  compute  the 
expectation  of  Y ■ g(X)  without  first  deriving  the  probability  function  of  Y. 

Theorem  5-2.  If  X Is  a discrete  random  variable  having  probability 
function  p and  If  the  expectation  of  Y ■ g(X)  exists,  then 
E(Y)  - 

The  more  general  applicability  of  the  theorems  above  becomes  apparent 
when  two  facts  are  observed.  First,  the  expectation  of  a discrete  random 
variable  X depends  only  on  the  probability  function  p of  X and  not 
on  the  nature  of  the  sample  space  upon  which  x Is  defined.  Therefore,  In 
considering  expectations  of  discrete  random  variables  (or  functions  of  dis- 
crete random  variables) , there  Is  no  loss  of  generality  In  assuming  that 
the  underlying  sample  space  Is  discrete.  Second,  for  any  random  variable 
X on  any  sample  space  S there  Is  a discrete  random  variable  X^  such 
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that  |X(s)  - X^(s) I ^ 1/n  for  all  sample  points  s,  namely, 

X^(s)  - J/n  if  J/n  < X(s)  S (j+l)/n 

where  j Is  restricted  to  Integer  values. 

Using  this  second  observation,  one  is  motivated  to  define  the  expecta- 
tion of  any  random  variable  X as  the  limit  of  the  expectations  of  the 
discrete  random  variables  X^,  assuming  that  the  limit  exists.  If  X has 
distribution  function  F,  then  “ J/n)  “ that 

(1)  E(X^)  - E 4)IF(^)  -Vih] 

n j n n n 

As  the  figure  below  indicates,  as  n^,  the  sum  of  the  positive  terms  in 


shaded  portion  to  the  left.  This  provides  a valid  geometrical  interpretation 
of  E(X)  as  the  difference  between  the  two  shaded  areas  depicted.^ 

Hlore  precisely,  E(X)  ■ / [1  - F(x)]dx  - / ^F(x)dx.  If  X is  a non- 
negative random  variable,  then  the  second  term  is  zero,  and  E(X)  ■ ~ F(x)]dx 

- /^P(X  > x)dx. 
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Now  suppose  X is  a continuous  random  variable  having  density 
function  f.  Then,  (1)  above  can  be  written  as 
(2)  E(X^)  - £ (J/n)  f(x)  dx  - g^(x).£(x)  dx 

where  ••  J/n  if  J/n  < x < (J+l)/n.  Since  g^(x)->x  as  n>®,  it 

follows  that 

E(X  ) -►  X f(x)  dx 

provided  that  / |x|  f(x)  dx  < <».  This  motivates  the  following  definition: 
Definition. 

Let  X be  a continuous  random  variable  having  density  function  f. 

Then  the  expectation  of  X Is  defined  by 
E(X)  - r X f(x)  dx 

* mOO 

provided  that  |x|f(x)  dx  < 

Examples. 

1.  Suppose  Y has  density  f(y)  ■ 2y  for  0 < y < 1. 

Then  E(Y)  - 2y^  dy  - 2/3. 

2.  Let  X be  the  waiting  time  In  the  telephone  problem.  (See  page  34.) 

Then  X has  density  f(y)  ■ Xe  for  x > 0 where  \ - 1/2,  and 

E(X)  - + /”  e-^*dx  - 1/x  - 2. 

2 

3.  If  Z has  density  f(z)  = 1/tt(1+z  )»  then  E(Z)  does  not  exist 
because  /_oo|z|/tt(1+z^)  dz  “ 

Note  that.  In  the  above  definition  of  expectation  for  the  continuous 
case  as  well  as  In  the  corresponding  definition  for  the  discrete  case,  the 
expected  value  of  a random  variable  X Is  analogous  to  the  centroid  (or 
center  of  gravity)  of  a unit  mass  spread  out  on  the  line  according  to  the 
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probablllty  distribution  of  X.  In  the  discrete  case,  if  one  has  masses 
p(Xj^),  p(x2),  ...  at  the  points  Xj^,  X2,  ...  on  the  line,  then  the  centroid 
of  that  distribution  of  masses  Is  at  E(X)  - £Xj^  p(x^).  Similarly,  If  a 
unit  mass  Is  distributed  continuously  over  the  real  line  according  to  the 
density  function  f,  then  the  centroid  of  the  distribution  of  mass  is  at 
/ X f(x)  dx.  The  following  theorem  becomes  apparent  from  this  Interpretation 
of  E(X). 

Theorem  5~3.  If  a random  variable  X having  finite  expectation  has 
a probability  or  density  function  that  Is  symmetric  about  a point  c,  then 
E(X)  - c. 

A second  measure  of  the  center  of  a distribution  Is  the  "median." 

Roughly  speaking,  the  median  of  a distribution  Is  a value  such  that  half  of 
the  probability  lies  to  the  left  of  the  value  and  half  to  the  right  with 
an  appropriate  adjustment  for  the  discrete  case. 

Definition.  The  median  of  a random  variable  X (or  of  the  distribution 
of  X)  Is  defined  to  be  any  value  m such  that  P(X  ^ m)  ^1/2. and 
P(X  S m)  2:  1/2. 

For  example.  If  X has  probability  function  p(x^)  - 1/n  where 


'‘1*  *2’ 


are  n distinct  real  numbers  such  that  x.  < x_ 
n J.  X 


. < X , 

n* 


then  the  median  of  X is  odd, and  any  number  between 

x^y2  *(n+2)/2  even.  Ordinarily,  In  the  latter  case,  one 

defines  the  median  to  be  the  average  of  ^^^2  ^(n4-2)/2* 

n > 10,  the  median  Is  usually  defined  as  the  average  of  x^  and  x^. 

If  X has  a continuous  distribution,  then  there  is  at  least  one  value 
m such  that  P(X  s;  m)  - 1/2.  Since  P(X  x)  "^FCx)  where  F Is  the  dis- 

tribution function  of  X,  the  median  of  X Is  any  solution  of  the  equation 
F(m)  - 1/2. 
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Whether  X has  a continuous  distribution  or  not,  if  the  distribu- 
tion Is  syimnetrlc  about  some  point  c,  then  the  median  of  the  distribution 
Is  equal  to  c. 

Exercises. 

1.  Let  X be  the  number  of  hats  returned  correctly  In  the  hat-check 
girl  problem.  (See  page  29.)  Show  that  the  median  of  X Is  1,  and 
E(X)  - 1.  Verify  that  the  geometric  Interpretation  of  E(X)  given  on  page 
44  holds  In  this  case. 

2.  Show  that  if  Y has  density  function  f(y)  ■ (2-y)/2  for  0 < y < 2, 

then  E(Y)  ■ 2/3,  and  the  median  of  Y Is  2 - / 2 . 

3.  Show  that  If  X has  density  function  f(x)  ■ 1/ (b  - a)  for 

a < X < b,  then  E(X)  > (a  + b)/2,  and  the  median  of  X has  the  same  value. 

The  linearity  properties  of  expectation  specified  In  Theorem  5-1  hold 

whether  the  random  variables  are  discrete  or  not.  The  theorem  that  corre- 

sponds to  Theorem  5-2  In  the  continuous  case  Is: 

Theorem  5-4.  If  X Is  a continuous  random  variable  having  density 
function  f and  If  Y ■ g(X)  Is  a random  variable  such  that  E[g(X)]  exists, 
then 

EW  - JZo  sM  d*- 

Exercises. 

1.  Let  X be  the  random  variable  In  the  spinner  problem,  and  let 
2 

Y ■ X . Apply  the  theorem  above  to  show  that  E(Y)  - 1/3. 

2.  Show  that  the  density  function  of  Y In  the  preceding  problem 
Is  f(y)  ■ 1/2/  y for  0 < y < 1.  Compute  E(Y)  from  the  definition 
and  thus  verify  the  result  In  Problem  1. 
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Def Inltlon.  The  variance  of  a random  variable  X,  denoted  by  Var  (X) 

2 2 

or  0^  , Is  defined  by  E(X  - |k)  where  ^ * E(X),  provided  that  this 
expectation  exists.  The  standard  deviation  of  X,  denoted  by  o^,  Is 
defined  as  the  positive  square  root  of  the  variance. 

The  variance  and  standard  deviation  are  measures  of  the  "spread"  of 
the  distribution  of  X.  Another  measure  of  spread  Is  the  mean  absolute 
deviation,  defined  as  E|X  - |a|.  The  reason  that  the  variance  and  standard 
deviation  are  more  widely  used  Is  that  these  measures  are  more  tractable 
for  reasons  that  will  become  apparent  later. 

Examples. 

1.  If  the  distribution  of  X Is  entirely  concentrated  at  a single 
point  c,  so  that  P(X  ■ c)  ■ 1,  then  E(X)  - c and  Var(X)  - 0. 

2.  Let  X be  the  number  of  heads  In  five  tosses  of  a fair  coin. 

Then  X has  probability  function  p(x)  ■ (^) (l/2)*(l/2)^  * ■ (^)(l/2)^. 

The  values  of  p are  as  follows: 

X 0 1 2 3 4 5 . 

p(x)  1/32  5/32  5/16  5/16  5/32  1/32 

By  the  symmetry  of  p around  x ■ 5/2,  It  follows  from  Theorem  5-3  that 

E(X)  - 5/2.  The  value  of  Var(X)  can  be  computed  directly  from  the  definition: 

Var(X)  - E(X  - (x  - 5/2)^p(x) 

- (-5/2)^(l/32)  + (-3/2)^(5/32)  + (-l/2)^(5/16)  + (l/2)^(5/16) 
+ (3/2)^(5/32)  + (5/2)^(l/32)  - 5/4. 

Thus,  the  standard  deviation  of  X Is  /2  - 1.12. 


The  following  theorem  often  facilitates  the  calculation  of  variance. 
Theorem  5-5.  If  X Is  a random  variable  for  which  E(X)  ■ ^ and 


2 

E(X  ) < <»  and  If  a and  b are  any  constants,  then 

(a)  Var(X)  - E(X^)  - 

(b)  Var(X  + b)  - Var(X). 

(c)  Var(aX)  - a^Var(X),  and  - lajo^. 

(d)  Var(aX  + b)  - a^Var(X). 

Proof:  Var(X)  ■ E(X  - - E(X^  - 2j*X  + ji^)  . Using  the  linearity 

properties  of  expectation  (Theorem  5-1)  gives 

Var(X)  - E(X^)  - 2vtE(X)  + - E(X^)  - . 

Parts  (b)  and  (c)  follow  from  (d) : 

Var(aX  + b)  - E(aX  + b - aj*  - b)^  - Ea^(X  - 
- a^E(X  - - a^Var(X). 

Examples . 

1.  Applying  part  (a)  of  the  above  theorem,  one  could  have  computed 

2 

Var(X)  In  the  previous  example  by  first  computing  E(X  ): 

E(X^)  - Ex^p(x)  - 0(1/32)  + 1(5/32)  + 4(5/16)  + 9(5/16)  + 16(5/32) 

+ 25(1/32)  - 15/2. 

Hence,  Var(X)  - E(X^)  - - 15/2  - (5/2)^  - 5/4. 

2.  Let  X be  a discrete  random  variable  having  probability  function 

p(x^)  - 1/n  where  Xj^,  X2,  ...,  x^  are  n distinct  real  numbers.  Then 

since  E(X)  ■ x - Ex^^/n,  Var(X)  - E(x^  - x)  /n.  Applying  Theorem  5-5(a), 

2 

one  can  compute  Var(X)  In  this  case  using  the  formula  Var(X)  ■ (Ex^  /n) 
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Theorem  5-6. 

(a)  If  Y la  a nonnegative  random  variable,  then  P(Y  c)  a E(Y)/c 
for  all  c > 0. 

(b)  (Chebyshev’s  Inequality)  For  any  random  variable  X having 

2 

finite  variance  a , 

P(|X  - vk|  i €)  a:  for  all  € > 0. 

In  particular,  P(|x  - |a|  k a)  ^ 1/k^  for  all  k > 0. 

Proof:  (a)  It  follows  Immediately  from  the  geometric  Interpretation 
of  E(Y)  that  E(Y)  ^ c P(Y  ^ c)  for  all  c > 0.  See  the  figure  below. 


(b)  P(|X  - i €)  - P((X  - i a;  E(X  - - oW. 

It  follows  from  part  (b)  of  the  theorem  that  P(|X  - < ko)  St  1 - 1/k^ 

for  all  k > 0.  The  table  on  the  next  page  compares  these  "Chebyshev  bounds" 
on  the  probabilities  P(|x  - pb|  < ka)  with  the  actual  probabilities  for  two 
distributions : 

(A)  The  distribution  of  the  number  of  heads  In  five  tosses  of  a fair 
coin.  (See  Exanq>le  2,  page  48.) 

(B)  The  continuous  distribution  having  the  "bell-shaped"  density  function 

—1/2  — x^/2 

f(x)  * (2tt)  e ' , which  has  mean  0 and  variance  1. 
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Table  1 


A COMPARISON  OF 
ACTUAL 

CHEBYSHEV  BOUNDS 
PROBABILITIES 

WITH 

P(|X  - ul  < kc) 

k 

Chebyshev 

bound 

Actual  (A) 

Actual  (B) 

1 

0 

5/8  - 0.625 

0.683 

2 

i 3/4 

15/16  - 0.938 

0.954 

3 

8/9 

1 

0.997 

4 

i 15/16 

1 

1.000 

5 

i 24/25 

1 

1.000 
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Exerclses. 

1.  The  probability  function  of  the  random  variable  X In  the  hat- 
check  problem  was: 

X g 1 2 3 4 

p(x)  3/8  1/3  1/4  0 1/24 

Here,  E(X)  > 1.  Compute  Var(X)  directly  from  the  definition,  and  check 
your  result  by  computing  Var(X)  using  the  formula 

Var(X)  - E(X^)  - E^(X). 


2.  Five  balls  are  chosen  at  random  from  an  urn  containing  9 balls  of 
which  3 are  white.  Let  X be  the  number  of  white  balls  In  the  sample. 

Find  Var(X)  If  the  sampling  Is  done  (a)  with  replacement,  (b)  without  re- 
placement. (See  Exercise  1,  page  36  and  Exercise  2,  page  40.) 

Ana.  (a)  10/9,  (b)  5/9. 

3.  Let  X be  the  random  variable  In  the  spinner  problem,  so  that  X 

has  density  function  f(x)  ■ 1 for  0 < x < 1.  (a)  Show  that  Var(X)  ■ 1/12. 

(b)  Show  that  P(|x  - E(X)  | < 2a)  - 1 and  P(|X  - E(X)|  < a)  - 1//3  - 0.577. 

2 

4.  Show  that  If  X has  mean  and  variance  a t then  Z ■ (X  - ^)/a 
has  mean  0 and  variance  1. 

5.  Suppose  X has  density  function  f(x)  ■ (2  - x)/2  for  0 < x < 2. 

(a)  Sketch  the  density  function  of  X and  find  P(0  < X < 1) . 

(b)  Find  and  sketch  the  distribution  function  of  X. 

(c)  Find  E(X)  and  Var(X). 

(d)  Find  P(|X  - 2a). 

Ans.  (a)  3/4,  (b)  F(x)  - 0 for  x < 0,  x(4  - x)/4  for  0 x £ 2,  1 for 

X > 2,  (c)  2/3,  2/9,  (d)  0.04. 

6.  If  X Is  the  sum  of  two  numbers  chosen  Independently  and  at  random 

between  0 and  1,  then  X has  density  f(x)  > 1 - |l  - x|  for  0 < x < 2. 

Find  (a)  P(l/2  < X < 3/2),  (b)  E(X) , (c)  Var(X),  (d)  P(jX  - j*)  > 2a). 

Ans.  (a)  3/4,  (b)  1,  (c)  1/6,  (d)  0.03. 

7.  Show  that, If  X Is  a random  variable  such  that  Var(X)  exists, 

then  among  all  real  numbers  c,  E(X  - c)^  Is  minimized  by  c > E(X). 
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SECTION  VI.  - SOME  SPECIAL  DISTRIBUTIONS 

References : 

Paul  L.  Meyer,  Introductory  Probability  and  Statistical 
Applications , 2nd  Edition,  Addlson-Wesley , 1960,  Chapters  8-9. 

Seymour  Llpschutz,  Theory  and  Problems  of  Probability,  Schaum's 
Outline  Series,  McGraw-Hill,  New  York,  1968,  Chapter  6. 

The  table  on  the  next  page  gives  the  probability  (or  density) 
functions,  means,  and  variances  of  some  frequently  encountered  distri- 
butions. Examples  of  random  variables  that  have  these  distributions 
are  given  below. 

Bernoulli.  Any  random  variable  that  takes  on  only  the  two  values 
1 and  0 with  probabilities  p and  q ■ 1-p  has  a Bernoulli  distri- 
bution with  parameter  p. 

Binomial.  The  number  of  successes  In  n Bernoulli  trials  with 
probability  p of  success  on  each  trial  has  a binomial  distribution 
with  parameters  n and  p.  (See  page  26.) 

Hypergeometrlc . If  X Is  the  number  of  defectives  In  a sample  of 
size  n taken  without  replacement  from  a lot  of  N Items  of  which  Np 
are  defective,  then  X has  a hypergeometrlc  distribution.  (See  Theorem  2-2.) 

The  values  of  the  probability  function  of  the  hypergeometrlc  distri- 
bution for  certain  values  of  o,  p,  and  N are  given  In  Table  2.  In 
each  case,  the  values  of  n and  p are  chosen  so  that  the  expected  number 
of  defectives  Is  E(X)  - np  - 2.  Note  that  for  fixed  values  of  n and  p 
the  distribution  becomes  more  variable  as  the  population  size  N Increases, 
since  the  variance  of  the  hypergeometrlc  distribution  Is  Var(X)  ■ npq(|j^), 
as  N -*■  a>  the  variance  tends  to  npq,  the  variance  of  a binomial  distribution 
with  parameters  n and  p. 

If  the  sample  of  size  n Is  taken  with  replacement  Instead  of  without 
replacement,  then  X has  a binomial  distribution  with  parameters  n and  p. 
As  Intuition  would  suggest.  If  the  population  size  Is  much  larger  than  the 
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Tablc  1 

A SHORT  TABLE  OF  DISTRIBUTIONS 


Distribution  and 
range  of  parameters 

Probability  or 
density  function 

Mean 

E(X) 

Variance 

Var(X) 

Bernoulli  (p) 

0 < p < 1 

X 1-x  ^ , 

p q , X - 0,1 

P 

pq 

Binomial  (n^p) 

0 < p < 1 

n * ly2y*** 

X ■ 0,1 n 

np 

npq 

Hypergeome tr Ic 
N - 1,2,... 
n ■ 1,2, . . . ,N 
p - 0,  1/N,...,(N-1)/N,1 

Np  Nq 

X n-x  „ _ , 

np 

,N-n. 

npq(^) 

M * x*o>i|...^n 

Poisson  (X) 
X > 0 

pX“  0^1,2,*.. 

X 

X 

Geometric 

0 < p < 1 

pq*~^,  X - 1,2,... 

1/p 

Negative 

Binomial 

0 < p < 1 

r - 1,2,, . . 

,x-l.  r x-r  . - 

(r.i>P  q , X - r,  r + 1.... 

r/p 

r<l/p^ 

Uniform  (a»b) 
“•<a<b<~ 

a + b 

(b  - a)^ 

b-a 

2 

12 

2 

Normal  (\kt  a ) 
a > 0 

1 -(X  - u)^/2a^ 

2 

e 

a *'  2tt 

cr 

Negative 
Exponential  (X) 

X > 0 

Xe~^*,  X > 0 

1/X 

l/x^ 

Gamma  (r,X) 
r > 0,  X > 0 

-XX  Q 

r(r)  * , X > 0 

r/X 

r/X^ 

Chi-square  (n) 

1 n/2  -1  -x/2 

n 

2n 

n • ly2^«».  - 

See  Gamma  (*22) 

r»/9  ^ ® fX^U 

2"'‘'r(n/2) 

Cauchy  (jt.X) 
X > 0 

go 

n(J?  +(x-(»)^} 

Laplace  (j»,X) 
X > 0 

-1_  -|x-ul/X 
2X  ® 

2X^ 

Pareto  (or,c) 

or  fCy.oh-1  ^ Qfc 

I <x>  ' * > spr 

if  a > 1 

“5^ 

c > 0,  Of  > 0 

(a-l)^(tr-2) 
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Table  2 


A COMPARISON  OF  HYPERGEOMETRIC,  BINOMIAL, 
AND  POISSON  PROBABILITIES 


Sample 

size 

P X 

N-5 

Hyper Reome trie 
10  20  50 

100 

Binomial 

Poisson 

5 

0.4  0 

_ 

.024 

.051 

.067 

.073 

.078 

.135 

1 

- 

.238 

.255 

.259 

.259 

.259 

.271 

2 

1.0 

.476 

.397 

.364 

.354 

.346 

.271 

3 

- 

.238 

.238 

.234 

.232 

.230 

.180 

4 

- 

.024 

.054 

.069 

.073 

.077 

.090 

5 

- 

— 

.004 

.007 

.009 

.010 

.036 

10 

0.2  0 

.043 

.083 

.095 

.107 

.135 

1 

.248 

.266 

.268 

.268 

.271 

2 

1.0  .418 

.337 

.318 

.302 

.271 

3 

.248 

.218 

.209 

.201 

.180 

4 

.043 

.078 

.084 

.088 

.090 

5 

- - 

.016 

.022 

.026 

.036 

6 

- 

.002 

.004 

.006 

.012 

7 

- 

.000 

.000 

.001 

.003 

8 

- - 

.000 

.000 

.000 

.001 

9 

- 

.000 

.000 

.000 

.000 

10 

— - 

.000 

.000 

.000 

.000 

20 

0.1  0 

.067 

.095 

.122 

.135 

1 

.259 

.268 

.270 

.271 

2 

1.0  .364 

.318 

.285 

.271 

3 

.234 

.209 

.190 

.180 

4 

.069 

.084 

.090 

.090 

5 

.007 

.022 

.032 

.036 

6 

- 

.004 

.009 

.012 

7 

- 

.000 

.002 

.003 

8 

- 

.000 

.000 

.001 

9-20 

- 

.000 

.000 

.000 
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sample  size,  then  the  hypergeometrlc  probabilities  P(X  ■ k)  differ 
little  from  the  corresponding  binomial  probabilities,  and  as  N « 
the  hypergeometrlc  probabilities  tend  to  the  binomial  probabilities. 

Table  2 compares  the  two  sets  of  probabilities  for  N ■ 100  and  for 
three  sample  sizes  n > 5,  10,  and  20. 

Poisson.  Suppose  that  events  of  a certain  type  (such  as  traffic 
accidents,  arrivals  at  a checkout  counter,  emissions  of  a-partlcles  from 
a radioactive  source,  vacancies  In  the  Supreme  Court  during  a year,  etc.) 
are  occurring  randomly  over  time  In  such  a way  that  certain  assumptions 
are  satisfied  (e.g.,  the  events  occur  singly,  and  the  numbers  of  occur- 
rences In  disjoint  time  Intervals  are  "Independent")*  Then  the  number 
of  occurrences  X In  a unit  time  Interval  can  be  assumed  to  have  a 
Poisson  distribution  with  parameter  X,  where  X Is  the  mean  number  of 
occurrences  In  an  Interval  of  length  one.^  The  number  of  occurrences  In 
a time  Interval  of  length  t has  a Poisson  distribution  with  parameter  Xt. 

The  Poisson  distribution  also  arises  as  a limit  of  binomial  distribu- 
tions as  n » and  p -»■  0 In  such  a way  that  np  **-  X*  Table  2 gives 
the  Poisson  probabilities  for  X * 2.  Compare  these  probabilities  with 
the  binomial  probabilities  for  (a)  n ■ 5,  p ■ 0.4;  (b)  n “ 10,  p ■ 0.2; 
and  (c)  n ■ 20,  p ■ 0.1.  In  all  three  cases,  np  - 2.  Note  that  as 
n Increases,  the  differences  between  the  binomial  and  Poisson  probabilities 
become  smaller. 

Geometric  and  Negative  Binomial.  These  distributions  occur  In  con- 
sidering the  number  of  Bernoulli  trials  required  until  a certain  number 
of  successes  occur.  If  X Is  the  number  of  trials  required  until  r 
successes  occur,  then  X has  a negative  binomial  distribution  with  para- 
meters r and  p,  where  p Is  the  probability  of  a success  on  each 
trial.  If  X Is  the  waiting  time  for  the  first  success  (l.e.,  the 
special  case  where  r - 1) , then  X has  a geometric  distribution.  For 
example , If  two  fair  dice  are  tossed  again  and  again  until  a total  of 
seven  occurs  for  the  first  time,  then  the  number  of  the  trial  on  which 
seven  occurs  has  a geometric  distribution  with  parameter  p • 1/6,  and 
the  expected  number  of  trials  Is  6. 


Hleyer,  op.  clt.,  pp.  166-168. 
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Uniform.  A random  variable  U has  a uniform  distribution  on  an  interval 
(a,b)  if  the  probability  that  U takes  on  values  in  any  subinterval  (c,d)  of 
(a,b)  is  proportional  to  the  length  of  the  subinterval,  and  the  probability 
that  U takes  on  values  outside  the  Interval  (a,b)  is  zero.  For  example,  the 
random  variable  in  the  spinner  problem  has  a uniform  distribution  on  (0,1). 

Normal.  This  distribution  is  the  most  frequently  used  of  all  distribu- 
tions in  statistical  applications  for  two  reasons:  (a)  many  statistical  cal- 

culations are  greatly  simplified  if  the  random  variables  Involved  are  assumed 
to  have  normal  distributions,  (b)  the  normal  distribution  provides  a reason- 
able approximation  for  distributions  of  repeated  measurements  of  many  physical 
phenomena — cranial  lengths , ballistic  measurements  (coordinates  of  deviations 
from  the  target),  logarithms  of  Incomes,  heights,  IQ  scores,  sums  or  averages 
of  several  test  scores,  etc.  The  normal  distribution  is  also  the  limiting 
distribution  of  many  distributions  (binomial,  hypergeometric,  Poisson,  negative 
binomial,  and  distributions  of  sums  and  averages  of  random  variables  that 
satisfy  certain  properties) . 

A random  variable  Z is  said  to  have  a standard  normal  distribution 
if  Z has  density  function  <p(z)  ■ (2tt)  e ^ for  -®  < z < «.  This 
bell-shaped  dsnslty  function  is  symmetric  about  zero.  It  is  easily  verified 
that  E(Z)  *•  0 and  Var(Z)  * 1.  The  distribution  function  of  Z,  commonly 
denoted  by  $ in  the  statistical  literature,  is  tabulated  in  Table  3.  For 
example,  P(Z  < 2)  f(2)  ■ 0.9772.  The  values  of  f(z)  for  negative  values 

of  z can  be  computed  using  the  formula  $(z)  « 1 - f(-z),  which  follows 
from  the  symmetry  of  the  distribution  about  zero.  For  example, 

P(Z  < -2)  “ 1 - j(2)  - 0.0228.  Note  that  P(-2  < Z < 2)  is  approximately 
0.95. 

If  X is  a random  variable  such  that  Z * (X  - ^i)/<t  has  a standard 

normal  distribution,  then  X is  said  to  have  a normal  distribution  with 

2 2 ' ” 
parameters  u and  g , which  is  often  abbreviated  to  X ~ N()*,a  ). 

Exercise.  Verify  that  (a)  the  random  variable  Z having  a standard 

normal  distribution  has  mean  0 and  variance  1,  (b)  P(-l  < Z < 1)  ■ 0.68, 

2 

(c)  X - pk  + has  mean  and  variance  a > (d)  X has  density  function 

— — e ^ • 

a 


f(x) 
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Table  3 

CUMULATIVE  NORMAL  DISTRIBUTION 


. t (*)  r*  <lt 


X 

.00 

.01 

.02 

.03 

.01 

.05 

.05 

.07 

.as 

.00 

.0 

.5000 

.5010 

.6080 

.5120 

.5160 

.5199 

.5239 

.,5270 

.5.110 

.5350 

.] 

.6398 

.SI3S 

.5178 

.5517 

.5.5,57 

.5506 

.,56,16 

.5075 

.,5714 

.575.1 

.2 

.5793 

.5832 

.58  # I 

.5910 

.50  IS 

..51K87 

.6.026 

.<KX;4 

.610,1 

.0141 

.3 

.6179 

.6217 

.0255 

.Ii293 

.6.131 

.6.16.8 

.6106 

.64  13 

.(■.l.sii:  .0517 

.4 

.0.751 

.GfiOl 

.602S 

.6001 

.0700 

.cvaii 

.0772 

.IVS08 

.6811 

.0870 

.5 

.0015 

.0050 

.G0S5 

.7019 

.70.54 

.7088 

.7123 

.7157 

.7100 

.7221 

.6 

.7257 

.7201 

.7324 

.7357 

.73S9 

.7122 

.7151 

.74SG 

.7517 

.7510 

.7 

.7580 

.7011 

.7042 

.7073 

.7701 

.7731 

.7701 

.7794 

.7823 

.7852 

.8 

.7881 

.7010 

.7939 

.7907 

.7905 

.8023 

.8051 

.8078 

.8100 

.8133 

.9 

.8159 

.8186 

.8212 

.8238 

.8204 

.8280 

.8315 

.8340 

.8305 

.8389 

1.0 

.8113 

.8138 

.8161 

.8185 

.8508 

.8531 

.8551 

.8577 

.8.590 

.8021 

1.1 

.8013 

.8665 

.8()80 

.8708 

.8720 

.8710 

.8770 

.8790 

.8810 

.8830 

1.2 

.8819 

.8809 

.8888 

.8907 

.8025 

.8041 

.8062 

.8980 

.8907 

.1)015 

1.3 

.0032 

.0019 

.9060 

.9082 

.9000 

.0115 

.0131 

.9147 

.9162 

.9177 

1.4 

.9192 

.9207 

.9222 

.0230 

.9251 

.9205 

.0279 

.9292 

.9300 

.0319 

1.5 

.9332 

.9345 

.9357 

.9370 

.9382 

.9301 

.9406 

.0418 

.9120 

.0441 

1.6 

.9152 

.9103 

.9474 

.0181 

.0  405 

.0505 

.0516 

.9525 

.0535 

.9545 

1.7 

. 0554 

.9504 

.9573 

.0582 

.0501 

.9590 

.0008 

.0016 

.9025 

,00.13 

1.8 

.9011 

.9040 

. 0050 

.0001 

.9071 

.9078 

.9080 

.9603 

.0090 

.0700 

1.9 

.9713 

,9710 

.0720 

.0732 

.9738 

.9744 

.0750 

. 9756 

.0701 

.0767 

2.0 

.9772 

.9778 

.9783 

.0788 

.9793 

.0708 

.9803 

.9808 

.0812 

.0817 

2.1 

.9321 

.9820 

.9830 

.9834 

.983S 

.0812 

.OHIO 

.0850 

.9851 

.0857 

2.2 

.0801 

.9804 

.9.SC8 

.0871 

.0875 

* .9878 

.0881 

.9881 

.9887 

,9800 

2.3 

.9893 

.0890 

.9898 

.9001 

.0001 

.0000 

.0000 

.00111 

1 .9013 

.9010 

2.4 

.9918 

.9920 

.9922 

.9025 

.0927 

.0020 

.9031 

.9032 

.9034 

.9030 

2.6 

.9038 

.9040 

.0941 

.9943 

.9945 

.9940 

.9048 

.9910 

.9951 

.9952 

2.6 

.9953 

.9055 

.9050 

.0957 

.9050 

.9900 

.9001 

.9902 

.9003 

.0004 

2.7 

.9906 

.9006 

,0907 

.0968 

.0009 

.0070 

.9071 

.0072 

.9073 

.9074 

2.8 

.9074 

.9975 

.9970 

.9077 

.9977 

.9978 

.9979 

.9970 

.0910 

.0081 

2.9 

.9081 

.9982 

.9982 

,9983 

.9084 

.9984 

.9985 

.0085 

.9080 

.0086 

3.0 

.9087 

.0987 

.9987 

,99a8 

.9988 

.9989 

.9989 

.9989 

.9990 

.9990 

3.1 

.9000 

.9001 

.9991 

.9901 

.9902 

.9902 

.9*K)2 

.9902 

.0003 

.0903 

3.2 

.9003 

.9093 

.9094 

.0994 

.0904 

.0994 

.9004 

.0005 

.0905 

.0005 

3.3 

.0005 

.9095 

.9995 

.0006 

.9000 

.9006 

.9900 

.0990 

.9006 

.0907 

3.4 

.0997 

.9007 

.9007 

.0907 

.0007 

.9007 

.9007 

.9907 

.9007 

.0008 

X 

1.282 

1 .045 

1 .900 

2.320 

2.576 

3.000 

3.291 

3.801 

4.417 

♦(*) 

.90 

.95 

.975 

.99 

.995 

.009 

.9995 

,00095 

.900095 

211  -f  (*)| 

.20 

.10 

.05 

.02 

.01 

.002 

.001 

.0001 

.00001 
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2 

If  X ^ N(^,a  )»  then  one  can  use  a table  of  the  standard  normal 

distribution  function  to  compute  any  probability  of  the  form  P(a  < X < b) : 

P(a  < X < b)  . . ,(tk)  . 

(TOO  O O 

For  example,  If  X~N(28,4),  then 

P(25  < X < 27)  - - §(^^)  - §(-0.5)  - §(-1.5) 

- 0.31  - 0.07  - 0.2A. 


The  normal  distribution  frequently  occurs  as  the  limiting  distribution 

of  sums  or  averages  of  a large  number  of  random  variables.  In  the  simplest 

case,  consider  a sequence  of  Bernoulli  trials  with  probability  p of  success 

on  each  trial.  Let  X^  be  1 or  0 according  as  the  ^th  trial  is  a success 

or  not , and  let  S - X,  + X-  + . . . + X . Then  S is  the  number  of  successes 

n 1 2 n n 

In  the  first  n trials,  which  has  a binomial  distribution  with  parameters  n 

and  p,  so  that  E(S^)  “ np  and  Var(S^)  ■ npq.  For  large  values  of  n, 

the  distribution  of  Is  approximately  normal  with  mean  ■ np  and 

2 

variance  a * npq  In  the  sense  that 
S - np 


P(a  i 


i b)  - $(b)  - $(a) 


'npq 


for  any  real  numbers  a < b,  and  as  n the  probability  on  the  left  tends 

to  the  limit  on  the  right.  This  Is  called  the  DeMolvre-Laplace  Central  Limit 

Theorem.  For  a proof,  see  W.  Feller,  An  Introduction  to  Probability  Theory 

and  Its  Applications,  Volume  I,  3rd  Edition,  John  Wiley,  1968,  pp.  182-186. 

(A  more  general  result  on  the  limiting  distribution  of  sums  of  outcomes  of 

Independent  trials  Is  contained  In  Section  VIII.) 

It  follows  that  for  any  Integer  k, 

P(S  S k)  . P(  . 

" /npq  /npq  /npq 

This  approximation  is  usually  Improved  by  first  replacing  ^ k)  by  the 

equivalent  quantity  ^(^jj  ^ ^ 1/2)  and  then  proceeding  as  before  to  obtain 
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P(S  iS  k)  - P(s  s:  k + 1/2)  - 

n n } 

✓npq 

This  so-called  "continuity  correction"  Is  motivated  by  the  fact  that  a step 
function  (namely,  the  distribution  function  of  S^)  Is  being  approximated 
by  a continuous  function  (the  distribution  function  of  a normal  distribution) 
that  tends  to  pass  through  the  "midpoints"  of  the  steps. 

Table  4 on  the  next  page  compares  the  two  normal  approximations  for  the 
case  where  n ■ 12  and  p ■ 1/4.  The  entries  In  the  column  headed  "Poisson 
approximation"  are  the  probabilities  P(X  £ k)  where  X has  a Poisson  dis- 
tribution with  parameter  k 3. 

Example.  If  35  percent  of  the  voters  of  a large  city  are  In  favor  of 
a given  proposal,  what  Is  the  probability  that  a random  sample  of  100  voters 
would  not  show  a majority  In  favor? 

Let  X be  the  number  In  the  sample  favoring  the  proposal.  If  the 
sampling  Is  done  with  replacement,  then  X has  a binomial  distribution  with 
parameters  n « 100  and  p ■ 0.55,  so  that  E(X)  ■ np  * 55  and 
a “ /npq  ■ 4.98.  Hence 

P(X  s:  50)  - P(X  i50.5)  - P(|^^  ^4!^^^  “ ♦(“O-^O)  - 0.18. 

Exercises . 1.  A man  claims  to  be  able  to  predict  whether  a fair  coin 
will  result  In  heads  before  It  Is  flipped.  To  test  his  contention  you  toss 
a fair  coin  100  times  and  record  the  number  of  times  that  he  predicts  the  re- 
sult correctly.  What  Is  the  approximate  probability  that  he  will  predict  the 
result  correctly  60  or  more  times  If  his  predictions  are  mere  guesses?  Ans.  0.03 

2.  Suppose  that  the  lifetimes  of  components  of  a certain  type  have  a 

2 

N(j*,o  ) distribution  with  ■ 1000  hours  and  o “ 100  hours.  What  Is  the 
approximate  probability  that , among  45  components  chosen  at  random  from  com- 
ponents of  this  type,  10  or  more  will  last  less  than  900  hours?  [To  make  the 
arithmetic  easy,  assume  that  i(-l)  * 1/6.]  Ans.  0.21. 
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Table  4 

A COMPARISON  OF  THE  NORMAL  AND  POISSON  APPROXIMATIONS  TO  THE 

BINOIIAL  PROBABILITIES  P(S  sC  k)  FOR  THE  CASE  n - 12,  p - 0.25 

n 


Normal  approximation 


k 

P(S  i k) 
n 

Without 

continuity 

correction 

With 

continuity 

correction 

Poisson 

approxi- 

mation 

0 

.0317 

.0228 

.0478 

.0498 

1 

.1584 

.0913 

.1587 

.1991 

2 

.3907 

.2525 

.3695 

.4232 

3 

.6488 

.5000 

.6305 

.6472 

4 

.8424 

.7475 

.8413 

.8153 

5 

.9456 

.9087 

.9522 

.9161 

6 

.9857 

.9772 

.9902 

.9665 

7 

.9972 

.9962 

.9987 

.9881 

8 

.9996 

.9996 

.9999 

.9962 

9 

1.0000 

1.0000 

1.0000 

.9989 

10 

1.0000 

1.0000 

1.0000 

.9997 

11 

1.0000 

1.0000 

1.0000 

.9999 

12 

1.0000 

1.0000 

1.0000 

1.0000 
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The  Lognonnal  Distribution.  A random  variable  X Is  said  to  have  a 

lognormal  distribution  If  Y « log  X has  a N(j*,a^)  distribution.  This  Is 

equivalent  to  saying  that  X has  a lognormal  distribution  If  there  Is  a 

normally  distributed  random  variable  Y such  that  X has  the  same  dlstrlbu- 
Y 

tlon  as  e . Since  X has  distribution  function 

F(x)  - P(X  i x)  - P(e^  i x)  - P(Y  s;  log  x)  « x > 0, 

— oo 

X has  density  function 

f(x)  - F’(x)  - fydog  x) 

2 2 

■ (1/0x/2tt)  exp  {-(log  X - j*)  /2a  ) for  x > 0. 
tY  2 2 

Using  the  fact  that  E(e  ) - exp  {^t  + a t /2}  for  all  values  of  t (see 
Meyer , op.  clt. , p . 210) , one  can  show  that 
E(X)  - E(e^)  - 

2 2 

Var(X)  - (e®  - 1). 

The  median  of  the  distribution  of  X Is  e^  . [In  general.  If  Y Is  a 
random  variable  having  median  m,  and  If  X Is  an  Increasing  (or  decreasing) 
function  of  Y,  say  X ■ h(Y),  then  the  median  of  X Is  h(m).] 

The  Negative  Exponential,  Gamma,  and  Chi-square  Distributions.  Suppose 
that  events  of  a certain  type  are  occurring  over  time  In  such  a way  that  x^, 

the  number  of  events  up  to  time  t,  has  a Poisson  distribution  with 
parameter  At  for  all  values  of  t.  Consider  the  waiting  time  T for  exactly 
r events  to  occur.  Then  the  distribution  function  of  T Is 

F(t)  - P(T  i;  t)  - P(X^  i r)  - 1 - e‘^*^(Xt)"/nl  for  t > 0. 

Therefore,  the  density  function  of  T is 

f(t)  - F’(t)  - - e”^^(xt)""^A/(n-l)!  + (Xt)V^‘^/n!  for  t > 0. 

Since  the  terma  in  the  first  sum  are  the  negatives  of  the  first  r-1  terms 
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In  the  second  sum,  the  density  reduces  to 

f(t)  - X(Xt)*^'V^V(r-l)I  for  t > 0. 

A random  variable  having  this  density  Is  said  to  have  a gaimna  distribution 
with  parameters  r and  X*  If  r - 1,  then 
f(t)  - \e~^^  for  t > 0. 

A random  variable  having  this  density  Is  said  to  have  a negative  exponential 
distribution  with  parameter  X* 

In  general,  If  events  are  occurring  randomly  over  time  In  such  a way 
that  the  number  of  occurrences  up  to  time  t has  a Poisson  distribution 
with  parameter  t,  then  not  only  Is  It  the  case  that  the  waiting  time  for 
the  first  occurrence  has  a negative  exponential  distribution  with  parameter 
X,  but  also  the  waiting  times  between  any  two  successive  occurrences  has  a 
negative  exponential  distribution  with  the  same  parameter.  Conversely,  If 
the  waiting  times  between  successive  occurrences  are  "Independent"  (see 
Section  VII)  and  If  these  waiting  times  have  a negative  exponential  distribu- 
tion with  parameter  X»  then  the  number  of  occurrences  In  any  fixed  time 
Interval  of  length  t has  a Poisson  distribution  with  parameter  Xt.  Thus, 
to  generate  a sequence  of  occurrences  for  which  the  Poisson  model  would 
apply.  It  suffices  to  generate  random  variables  having  negative  exponential 
distributions.  (See  Exercise  2 below.) 

The  parameter  r In  the  gamma  distribution  was  assumed  to  be  a posi- 
tive Integer  above,  but  the  gamma  distribution  can  be  defined  for  all  positive 
values  of  r by  specifying  the  density  as 

f(t)  - X(Xt)’^"V^*^/r(r)  for  t > 0, 
where  F Is  the  gamma  function  defined  by 

r(r)  “ Jq  * ® 
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It  can  be  shown  using  Integration  by  parts  that 

r(r)  - (r-l)r(r-l), 

and  since  r(l)  ■ /q  e * dx  ■ 1,  it  follows  that  F(r)  - (r-1) I for  all 
positive  integers  r.  It  can  be  shown  that  r(l/2)  i^T.  Applying  the 
formula  above,  one  can  compute  F(3/2)  ■ /2,  F(5/2)  ■ 3»4t  /^»  etc. 

The  chi-square  distribution  with  n degrees  of  freedom,  which  will 
be  discussed  in  Section  VIII,  is  a special  case  of  the  gamma  distribution 
with  parameters  r ■ n/2  and  \ * 1/2. 

Exercises. 

1.  Show  that,  if  X has  a gamma  distribution  with  parameters  r 
and  X»  then  E(X)  * r/X  and  Var(X)  ■ r/X^. 

2.  Show  that,  if  U has  a Uniform(0,l)  distribution,  then  T <■  -log  U 
has  a negative  exponential  distribution  with  parameter  X * 1,  and  V > T/x 
has  a negative  exponential  distribution  with  parameter  X 

The  Cauchy  Distribution.  A random  variable  X is  said  to  have  a 
Cauchy  distribution  with  parameters  and  X > 0 if  X has  density 
function 

f(x)  - 5 — 5—  f - "o  < X < «>  . 

Since  the  Cauchy  distribution  has  a bell-shaped  density  function  that  is 
symmetric  about  the  median  of  the  distribution  is  The  distribution 

is  of  primary  Interest  to  statisticians  as  a source  of  counterexamples. 

The  expectation  and  variance  of  random  variables  having  this  distribution 
do  not  exist,  and  certain  averages  of  random  variables  having  Cauchy  dis- 
tributions have  peculiar  properties  that  will  be  discussed  in  Section  VIII. 
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Laplace  Distribution.  A random  variable  X is  said  to  have  a Laplace 
(or  double  exponential)  distribution  If  it  has  density  function 
f(x)  " ^ _ CO  < X < <*>  . 

This  tent-shaped  distribution,  which  is  symmetric  about  its  mean  is 
primarily  of  theoretical  interest,  in  part  because  of  problems  related 
to  estimating  the  parameter  4.  The  case  n “ 0 arises  in  considering 
differences  of  random  variables  that  have  negative  exponential  distributions. 

Pareto  Distribution.  This  distribution  has  density 
f(x)  - (of/c)  (c/x)°^^  for  X > c. 

This  arises  in  considering  distributions  of  characteristics  which  have  been 
"truncated"  from  below.  For  example,  consider  the  distribution  of  incomes 
among  families  that  have  incomes  exceeding  $20,000,  or  the  distribution  of 
rain-gauge  readings  after  storms  that  yield  more  than  one  inch  of  rain.  The 
parameter  c above  is  the  truncation  point.  Since  P(X  > x)  ■ (c/x)*^  for  x > c 
by  Exercise  1 below,  the  parameter  or  indicates  how  rapidly  the  probability 
in  the  "tall"  of  the  distribution  tends  to  2ero. 

Other  Truncated  Distributions.  The  distribution  of  any  random  variable 
X can  be  truncated  to  the  left  (or  right)  at  some  point  c by  considering 
the  (conditional)  distribution  of  X on  the  set  {X  > c}  (or  {X  < c}).  If 
X has  density  function  f(x),  the  conditional  probability  that  X i x given 
that  X > c is 

P(X  s x|X  > c)  - - /^  f(x')dx'/[l-F(c)]  for  x > c. 

This  can  be  viewed  as  the  "conditional"  distribution  function  of  X given 
that  X > c.  Taking  the  derivative  of  P(X  s x|X  > c)  with  respect  to  x 
yields  the  density  function 

' {f(x)Al-F(c)] 


for  X i c 
for  X > c. 
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Thls  density  function,  which  Is  zero  for  x a:  c and  has  the  same  shape  as 

f(x)  for  X > c.  Is  said  to  be  the  density  function  of  the  distribution  of 

X truncated  to  the  left  at  x ■ c.  Density  functions  of  distributions 

truncated  to  the  right  and  probability  functions  of  distributions  truncated 

to  the  left  (or  right)  are  defined  similarly. 

2 

Examples . 1.  If  X ~ N(i«,a  )>  the  density  of  the  distribution  of  X 

truncated  to  the  left  at  x c Is 

<7 

Where  K - 1/P(X  > c)  - [1  - *(^)]’^  - 

<j  or 

It  can  be  shown  that  the  expectation  and  variance  of  the  truncated  distribu- 
tion are  ^ Xa  and  (1  - X^) o’  + Xa(c  - n)  where  X “ *P^**^^ / • * 

[See  H.  Cramer,  Mathematical  Methods  of  Statistics,  Princeton  University  Press, 
Princeton,  1946,  p.  249.  The  function  f(t)/(p(t)  Is  tabulated  In  D.  B.  Owen, 
Handbook  of  Statistical  Tables.  Addlson-Uesley,  Reading,  Massachusetts,  1962, 

pp.  1-10.] 

2.  Suppose  X has  a negative  exponential  distribution  with  parameter  X> 

Then  « \ 

P(X  > c)  ” /*  Xe~^*dx  - e for  all  c > 0. 

' c 

The  density  of  the  distribution  of  X truncated  to  the  left  at  x > c Is 
g(x)  ■ Xe^^^/e"^*^  - for  x > c > 0. 

In  this  case,  the  truncated  density  Is  the  same  as  the  original  density 
except  that  It  has  been  shifted  c units  to  the  right.  It  follows  that  the 

2 

expectation  and  variance  of  the  truncated  distribution  are  c + 1/x  and  1/X  . 

Exercises.  1.  Show  that.  If  X has  the  Pareto  density  with  parameters 
or  and  c,  then 

(a)  P(X  > x)  - (c/x)*  for  x > c, 

(b)  E(X)  - acKd-l)  for  a > 1. 

Note  that  the  expectation  does  not  exist  If  or  ^ 1. 
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2.  Let  X be  the  lifetime  In  hours  of  a component  chosen  at  random 
from  electronic  components  of  a certain  type.  Then  the  probability  P(T  > t) 
can  be  Interpreted  as  the  proportion  of  components  of  that  type  that  last 
for  more  than  t hours.  In  reliability  theory,  the  function  defined  by 
R(t)  - P(T  > t)  for  t > 0 

Is  called  the  reliability  function  for  these  components.  Clearly,  R(t)  ■ 1 - F(t) 
where  F Is  the  distribution  function  of  T.  For  example.  If  T has  a negative 
exponential  distribution  with  parameter  \ then  R(t)  ■ e for  t > 0. 

(a)  Suppose  n components  are  chosen  at  random  from  components  having 
reliability  function  R(t),  and  all  of  them  begin  operating  at  the  same  time. 

Let  N(t)  be  the  number  of  these  components  that  are  still  operating  after 

t hours.  Show  that  E[N(t)]  ■ n R(t)  and  P{N(t)  ■ n}  = (R(t)]”. 

(b)  Show  that.  If  T has  density  function 
f(t)  - Xkt^  ^exp(-Xt^)  for  t > 0 

where  X»  k are  positive  parameters,  then  R(t)  - exp(-Xt  ) for  t > 0. 

A random  variable  having  this  density  Is  said  to  have  a Welbull  distribution 
with  parameters  k and  Note  that.  If  k » 1,  this  Is  the  same  as  the 

negative  exponential  distribution  with  parameter 

(c)  Show  that  the  random  variable  T In  part  (b)  has  the  same  dls- 

1/k 

trlbutlon  as  X where  X has  a negative  exponential  distribution  with 
parameter  X,  and  use  this  to  show  that  the  n^^  moment  of  T Is 
E(t”)  - x’^^’VCn/k  + 1)  . 
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SECTION  VII.  - JOINT  DISTRIBUTIONS.  CORRELATION.  AND  CONDITIONING 
References: 

Paul  L.  Meyer,  Introductory  Probability  and  Statistical 
Applications , 2nd  Edition,  Addlson-Wesley,  1960,  Chapter 
9 and  pp.  14A-158. 

SeyiDour  Llpschutz,  Theory  and  Problems  of  Probability, 

Schaum's  Outline  Series,  McGraw-Hill,  New  York,  1968, 

Chapter  5. 

Paul  E.  Pfeiffer,  Concepts  of  Probability  Theory.  McGraw-Hill, 

New  York,  1965,  pp.  1A2-179. 

Let  X and  Y be  two  random  variables  defined  on  the  same  sample  space 
S.  Just  as  a single  random  variable  X carries  probabilities  from  S Into 
the  line  R,  thereby  determining  a probability  measure  on  R called  the  dis- 
tribution of  X,  the  pair  of  random  variables  (X,Y)  carries  probabilities 

2 1 
Into  the  plane  R , determining  a probability  measure  on  the  Borel  subsets  of 

2 

R called  the  joint  distribution  of  X and  Y.  In  particular,  the  probability 
carried  into  the  half-open  rectangle  (a,b]x(c,d]  by  (X,Y)  Is 

P(a  <Xib,  c<Y:£d)  - P {s:  a < X(s)  S b,  c < Y(s)  :£  d}. 
Definition.  Two  random  variables  X and  Y are  said  to  have  a discrete 

joint  distribution  If  there  is  a countable  set  A " 

k«l,2,...)  such  that  P{(X,Y)  « A}  > 1.  In  this  case,  the  function  p 
defined  on  A by 

p(Xj,yk>  “ P(X  » Xj,  Y - y^^)  for  J-1,2 k-1,2,... 

is  called  the  joint  probability  function  of  X and  Y. 

Clearly,  p(x,y)  ^ 0 for  all  (x,y)  In  A and  E p(Xj,y^)  ■ 1.  Also, 

if  Pj^  and  p^  are  the  probability  functions  of  X and  Y,  then 

Px(Xj)  “ p(Xj.yj^)  for  J-1,2 and  PY<yj^)  - p(xj,y^)  for  k- 

^he  class  of  Borel  subsets  of  R^  Is  the  smallest  collection  of  sets  that 
contains  the  rectangles  (a,b]x(c,d]  and  Is  closed  under  countable  set 
operations. 


-69- 


In  this  context,  and  are  called  the  marginal  probability  functions 

of  X and  Y to  distinguish  them  from  the  joint  probability  function  p. 

If  the  joint  probability  function  Is  given  by  a two-way  table  as  In  the 
example  below,  then  the  marginal  probability  functions  can  be  obtained  by 
summing  the  rows  and  columns  In  the  table. 

Example . A fair  coin  Is  tossed  four  times.  Let  X be  the  number  of 
heads  on  the  first  two  tosses,  and  let  Y be  the  number  of  heads  on  all  four 
tosses.  Then  the  joint  and  marginal  probability  functions  of  X and  Y are 
as  follows: 


X 

y 

0 

1 

2 

Py(y) 

0 

1/16 

0 

0 

1/16 

1 

1/8 

1/8 

0 

1/4 

2 

1/16 

1/4 

1/16 

3/8 

3 

0 

1/8 

1/8 

1/4 

4 

0 

0 

1/16 

1/16 

X 

X 

1/4 

1/2 

1/4 

1 

The  joint  distribution  of  any  pair  of  random  variables  Is  completely 
determined  by  their  joint  distribution  function  F,  which  Is  defined  by 
F(x»y)  “ P(X  s X,  Y i y)  for  all  x and  y. 

To  distinguish  the  joint  distribution  function  from  the  Individual  distribu- 
tion functions  of  X and  Y,  the  latter  are  referred  to  as  the  marginal 
distribution  functions  In  this  context.  The  marginal  distribution  functions 
can  be  determined  from  the  joint  distribution  function  by 
Fj^(x)  - F(x,«)  and  F^Cy)  - F(“,y). 

The  "bivariate"  distribution  function  F has  properties  analogous  to 
those  In  the  "univariate"  case  (see  page  34) . 
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(a)  0 i F(x,y)  i 1 for  all  (x,y)  in  R . 

(b)  F(x,-»)  = F(-»,y)  ” 0 for  all  x and  y,  and  F(«»,«»)  “ 1. 

(c)  F is  monotonlcally  Increasing  and  right  continuous  in  each  of  its 
arguments . 

(d)  P(a  < X i b,  c < Y ^ d)  - F(b,d)  - F(b,c)  - F(a,d)  + F(a,c). 

Although  the  joint  distribution  function  is  of  theoretical  interest  since 

it  characterizes  any  type  of  joint  distribution,  it  is  hard  to  visualize  and 
awkward  to  work  with.  Therefore,  in  practice,  the  joint  distribution  of  a 
pair  of  random  variables  is  ordinarily  specified  by  giving  either  their  joint 
probability  function  or  their  joint  density  function,  which  is  defined  as 
follows : 

Definition.  Two  random  variables  X and  Y are  said  to  have  a con- 
tinuous (or  absolutely  continuous)  joint  distribution  if  there  is  a nonnegative 

2 

function  f on  R (called  the  joint  density  function  of  X and  Y)  such 
that  for  all  (x,y) 

F(x,y)  * /*  f(x',y')  dy'  dx'. 

—00  —00 

This  is  equivalent  to  saying  that  X and  Y have  a continuous  dlstribu- 

2 

tlon  if  there  is  a nonnegative  function  f on  R such  that  for  all  real 
numbers  a,b,c,  and  d with  a < b and  c < d 

P(a  <X<b,  c<Y<d)  ■ f(x,y)  dy  dx. 

C 

Hence,  in  this  case,  the  probability  P(a<X<b,  c<Y<d)  has  the  geo- 
metrical interpretation  as  the  volume  under  the  surface  z « f(x,y)  and  above  the 
rectangle  (a,b)x(c,d). 

If  X and  Y have  joint  density  function  f,  then  the  "marginal"  density 
function  of  X is  “ /*  f(x»y)  <^7*  because  “ /*/  f(**»y)  dy  dx* 

^ ^co  —00—00 

and  it  follows  from  the  definition  of  the  density  function  (see  page  35)  that 

^00 

X has  the  density  function  ^ f(*»y)  dy. 
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Example,  Let  X and  Y be  the  successive  waiting  times  for  two  calls 
coming  Into  a telephone  exchange.  Suppose  that  X and  Y have  Joint  density 
function 

f(x,y)  ■ e for  x > 0,  y > 0. 

[Assume  here  and  below  that  f(x,y)  0 for  values  of  x and  y other  than 
those  for  which  the  functional  form  Is  specified.]  In  this  case,  the  marginal 
density  function  of  X Is 

fj^(x)  ■ /”  f(x,y)  dy  - /"  dy  » e"’^  for  x > 0. 

— oo 

By  the  symmetry  of  the  joint  density  function,  Y has  the  same  density  function 
as  X.  The  following  two  examples  Illustrate  how  the  Joint  density  function 
can  be  used  In  computing  probabilities  of  events: 

(a)  P(mln(X,Y)  > 2)  - P(X  > 2,  Y > 2)  - /“  /“  dy  dx 

" /2  e~*[/“  e"^  dy]dx  - e~^/“  e * dx  - e 

(b)  P(X  + Y < 2)  - /q/J"*  dy  dx 

- /q  e”""  [1  - e*"^l  dx  - /q  (e'*  - e'^)dx  - 1 - 3e'^. 

Given  the  Joint  density  of  X and  Y,  one  can  (In  theory)  derive  the  distribu- 
tion of  random  variables  Z that  are  functions  of  X and  Y.  For  example, 
let  Z - X + Y.  Then  the  distribution  function  of  Z for  z > 0 Is 
F(z)  ■ P(Z  i z)  ■ P(X  +Y^z)“l-e*-ze^. 

The  last  expression  follows  by  a calculation  like  that  In  (b)  above.  It  follows 
that  Z has  the  density  function 

f^(2)  " F'(z)  «e^+ze*-e*“ze^  for  z > 0. 

Exercises. 

1.  Three  balls  are  placed  at  random  Into  one  of  three  cells.  Let  X 
be  the  number  of  balls  In  cell  #1  and  Y the  number  In  cell  #2. 
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(a)  Verify  that  the  joint  and  marginal  probability  functions  of  X and 
Y are  as  follows: 


X 

0 

1 

2 

3 

Py(y) 

0 

1/27 

1/9 

1/9 

1/27 

8/27 

1 

1/9 

2/9 

1/9 

0 

4/9 

2 

1/9 

1/9 

0 

0 

2/9 

3 

1/27 

0 

0 

0 

1/27 

VyXy.) 

8/27 

4/9 

2/9 

1/27 

1 

(b)  Derive  the  probability  function  of  Z * X + Y and  verify  that 

E(Z)  - E(X)  + E(Y).  Ans.  p(0)  - 1/27,  p(l)  - 2/9,  p(2)  - 4/9,  p(3)  - 8/27. 

(c)  Show  that  Var(X)  ■ Var(Y)  “ Var(Z)  ■ 2/3,  so  that  Var(X  + Y) 

^ Var(X)  + Var(Y)  in  this  case. 

2.  Suppose  X and  Y have  the  joint  density  function 

f (*»y)  * X + y for  0<x<l,  0<y<l. 

(a)  Show  that  P(X  < 1/2,  Y < 1/2)  - 1/8. 

(b)  Show  that  X and  Y have  the  same  marginal  density  function 

g(x)  - X + 1/2  for  0 < X < 1,  and  P(X  < 1/2)  - P(Y  < 1/2)  - 3/8.  [Note 

that  P(X  < 1/2,  Y < 1/2)  i‘  P(X  < 1/2)P(Y  < 1/2).] 

(c)  It  can  be  shown  that,  if  Z ■ X + Y,  then  Z has  the  density 


function 


h(z) 


for  0 < z < 1 
for  1 < z < 2. 


Show  that  E(Z)  = 7/6,  Var(Z)  =•  5/36,  E(X)  - E(Y)  « 7/12,  and  Var(X)  - Var(Y) 
- 11/144.  Thus,  E(X+Y)  - E(X)  + E(Y)  but  Var(X+Y)  Var(X)  + Var(Y). 
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3.  Suppose  X and  Y have  Joint  density  £ui 

f(*»y)  “ 1/2  for  0 < X < y < 2, 
so  that  the  density  function  Is  constant  over 
the  shaded  region  In  the  figure  at  the  right. 

(a)  Show  that  P(X  < 1)  - 3/4. 

(b)  Show  that  the  marginal  density 

functions  of  X and  Y are  " (2-x)/2  for 

0 < X < 2 and  fy(y)  ■ y/2  for  0 < y < 2. 

(c)  Verify  that  Z * Y - X has  the  same  density  function  as  X by 

first  noting  that  1 - - X > c)  ■ (2  - c)^/4  for  0 < c < 2. 

(d)  Show  that  E(X)  ■ E(Z)  ■ 2/3  and  E(Y)  * 4/3,  verifying  that 
E(Y  - X)  - E(Y)  - E(X). 

(e)  Show  that  Var(X)  - Var(Z)  - Var(Y)  - 2/9. 

The  definitions  above  for  the  "bivariate"  case  extend  Immediately  to 
the  "multivariate"  case.  Let  Xj^,  X2,  . . . , X^  be  n random  variables 
defined  on  the  same  sample  space.  If  the  random  variables  X^  are  all  discrete, 
then  the  joint  probability  function  of  Xj^,...,X^  Is  defined  by 

p(Xj^,X2 x^)  - P(Xj^  - Xj^,X2  - X2,  ...,  X^  - x^). 

Whether  the  random  variables  are  discrete  or  not , the  joint  distribution 

function  of  X, , . . . ,X  Is  defined  by 
1’  ’ n ^ 

F (Xj^ , X2 » • . . ^ ^^1  *••»  ^jj  ^ ^n^  * 

The  random  variables  are  said  to  have  a continuous  j pint  distribution  If  there 
Is  a function  f on  r”  (called  the  joint  density  function)  such  that 

P ( (Xj^  ’^2  ’ * * * *^n^  ^ ^ * * * ^ ^ ^^1  * * * * *^1  * * * 

for  all  n-dlmenslonal  rectangles  B ■ (aj^,bj^)x(a2»b2)X. . .x(a^,b^)  . 


tlon 
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The  following  theorem  Is  the  multivariate  analog  of  Theorems  5-2  and 
5-4  In  the  univariate  case.  The  proof  In  the  discrete  case  Is  like  that  given 
for  Theorem  5-2. 

Theorem  7-1.  Let  Y ■ gCX^^jX^, . . . ,X^)  be  a random  variable  such  that 
E(Y)  exists.  Then 

(a)  If  X^,  X^,  . . . , X^  are  discrete  random  variables  having  Joint 
probability  function  p, 

E(Y)  « ZigCxj^.Xj,.  ..,x^)p(Xj^,X2,...,x^) 

where  the  summation  Is  over  all  points  (x. ,x_,...,x  ) for  which 

i z n 

p(Xj^,X2 x^)  > 0. 

(b)  If  Xj^,X2» . . . ,X^  have  Joint  density  function  f, 

E (Y  / J •••/  8 >X2 1 • • • ^ ^^1  * * * * ^^n^  ^^^1  * * * ^^n  * 

—00—  00  —00 

Example.  A fair  die  Is  tossed  three  times.  Let  X^  be  the  result  on 

the  1^^  toss.  Then  the  Joint  probability  function  of  Xj^,  X2,  X^  Is  p(x^,X2»x^) 

3 

“ (1/6)  for  all  (Xj^,X2»X2),  x^  ■ 1,2,..., 6.  By  the  theorem  above.  If  Y ■ Xj^X2X^ 
then 

E(Y)  - EXj^X2X3/6^, 

the  summation  being  over  all  triples  (x^,X2,x^)  with  1,2,..., 6.  But 

since  EXj^X2X2  is  the  expansion  of  (l+2+3+4+5+6)^,  E(Y)  ■ (21)^/6^  ■ (7/2)^. 

Note  that.  In  this  case,  E(Xj^X2X2)  ■ E(Xj^)  *E(X2) ‘ECX^) . It  is  not  true  In 

general  that,  for  any  two  random  variables  X and  Y,  E(XY)  - E(X)*E(Y). 

Definition.  The  random  variables  X.  , X.,,  . . . , X_  are  said  to  be 
1 z n 

independent  i£  for  all  Borel  subsets  A.,  • , • • • «A  of 

1 z n 

PCXj^eA^,  X26A2,  ....  X^eA^)  - nJ.^P  (X^«A^)  . 

It  can  be  shown  that  this  relationship  holds  for  all  Borel  subsets  A^ 
if  and  only  if  It  holds  for  all  seta  A^^^  of  the  form  A^^^  ■ x]  for  some  x. 
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Hence,  Xj^,  X2,  . . . X^  are  Independent  If  and  only  if  their  joint  distribution 

function  F is  the  product  of  the  marginal  distribution  functions: 

F(x  ) - F (x,)F-(x-)***F  (x  ) for  all  x-,x„,...,x  , 

1 n 1 1 z z nn  L z n 

where  F^^  is  the  distribution  function  of  X^.  In  the  discrete  case,  it  follows 
immediately  from  the  definition  of  Independence  that  the  joint  probability 
function  must  be  the  product  of  the  marginal  probability  functions  at  every 
point  (Xj^,X2, . . . ,x^)  : 

p(x2*X2 “ Pl^*l^P2^*2^*"Pn^*n^  • 

In  the  continuous  case,  if  X^,  X^,  . . . , X^  are  Independent  and  X^  has  the 
marginal  density  function  f^,  then  the  joint  density  can  be  taken  as  the 
product  of  the  marginal  density  functions: 

f (x^,X2,. . . ,x^)  - fj^(Xj^)f2(x2)*  • • f^(x^). 

In  a probability  model  for  n Independent  experiments  (see  page  25) , 
if  Xj^  depends  on  the  outcome  of  the  k^^  trial  only,  then  the  random  vari- 
ables Xj^,  X2,  X^  are  independent,  in  which  case  the  joint  distribution 

function  (or  probability  function  or  density  function)  is  the  product  of  the 
marginal  distribution  functions  (or  probability  functions  or  density  functions) . 
Examples. 

1.  The  random  variables  X^,  X2,  and  X^  in  the  previous  example  are 
independent.  Here,  the  marginal  probability  function  of  X^  is  Pj^(x)  “1/6 
for  x“l,2,...,6. 

2.  The  random  variables  X and  Y having  joint  density  f(x,y)  - e ^ 

for  X > 0,  y > 0 are  independent.  As  was  shown  on  page  71,  the  marginal 
density  functions  of  X and  Y were  ® x > 0 and 

fy(y)  “ e ^ for  y > 0.  Hence,  the  joint  density  is  the  product  of  the 


marginal  densities  in  this  case. 
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3.  Consider  a sequence  of  n Bernoulli  trials  with  probability  p of 

success  on  each  trial.  Let  be  1 or  0 according  as  the  trial  is  a 

success  or  not.  Then,  since  the  probability  function  of  X.  Is 

*1  ^"*1 

Pl(Xi)  - p (1-p)  for  - 0 or  1, 

the  joint  probability  function  of  Xj^,  X2,  . . . , X^  Is 

t \ n ^ 2x.  n-2x. 

p(Xj^,X2,...,x^)  - p (1-p)  -p  t(i_p)  1^ 


4.  Let  X^,  X2,...,  X^  be  Independent  random  variables,  each  having 
2 

a ^I(u,0’  ) distribution.  Then  the  joint  density  function  of  X,,  X.,...,X 

-(x-v)V2o2  ^ ^ 

" f (Xj^,X2,. ..  ,X^)  - n 1 e 


is 


1-1 


2tt  or 

,,  ,-n/2  -n 
(2tt)  a e 


Theorem  7-2.  If  X and  Y are  Independent  random  variables,  then 

(a)  so  are  U - g(X)  and  V - h(Y)  , 


(b)  E(XY)  - E(X)-E(Y), 

(c)  E[g(X)h(Y)l  - E[g(X)].E[h(Y)], 
and  (d)  Var(X  + Y)  - Var(X)  + Var(Y), 


provided  that  the  Indicated  expectations  exist. 

Proof:  (a)  P(U  e A,  V < B)  - P(g(X)  c A,  h(Y)  e B)  - P(X  e g"^(A),  Y c h"^(B)) 

- P(X  c g"^(A))  P(Y  c h”^(B))  - P(g(X)  < A)P(h(T)c  B) 

- P(U  c A)P(V  e B) . [Note:  g ^(A)  Is  defined  as 

{x:  g(x)  e A}.] 

(b)  In  the  discrete  case.  It  follows  from  Theorem  7-1  that 
E(XY)  - Exy  p(x,y)  - Exy  Pjj(x) ‘p^Cy)  - 2bt  Pjj.(x)*i:y  p^Cy) 

- E(X)-E(Y). 


The  proof  for  the  continuous  case  is  similar. 


-77- 


(c)  This  follows  Immediately  from  (a)  and  (b) . 

(d)  Var(X+Y)  - E(X  + Y - E(X+Y))^  - E[(X>i»j^)  + 

- E((X  - + (Y  - + 2(X  - Ux)(Y  - 

- Var(X)  + Var(Y)  + 2E[(X  - (Y  - ] . 

By  (c) , the  last  term  Is  equal  to  E(X  - u^)*E(Y  - ■ 0. 

Note  from  the  proof  of  (d)  above  that.  In  general, 

Var(X  + Y)  - Var(X)  + Var(Y)  + 2E[(X  - (X  - p^)l. 

The  expectation  In  the  last  term  on  the  right,  which  has  value  0 when  X 
and  Y are  Independent,  provides  a convenient  measure  of  association  between 
two  random  variables. 

Definition.  The  covariance  of  two  random  variables  X and  Y Is  defined 
by 

Cov(X,Y)  - E[(X  - UjjHY  - vky)], 

provided  the  Indicated  expectations  exist.  If  X and  Y have  nonzero  varl- 
2 2 

ances  and  , the  correlation  coefficient  of  X and  Y,  denoted  by 

p(X,Y)  or  Just  by  p If  no  ambiguity  results.  Is  defined  by 

Cov(X.Y)  , „ X-E(X)  Y-E(Y) 

^ ®Y  . 

X and  Y are  said  to  be  uncorrelated  If  Cov(X,Y)  ■ 0. 

Theorem  7-3.  Assuming  that  all  the  expectations  Indicated  below  exist, 
the  following  properties  hold; 

(a)  Cov(X,Y)  - E(XY)  - E(X)*E(Y) 

(b)  Cov(X,Y)  ■ Cov(Y,X)  and  Cov(X,X)  ■ Var(X) 

(c)  Cov(aX  + b,  cY  + d)  - ac»Cov(X,Y) 

(d)  Cov(Ea^X^,  Y)  - Ea^  Cov(X^,Y) 

(e)  Var(X  + Y)  - Var(X)  + Var(Y)  + 2Cov(X,Y) 

- + 0^^  + 2p 
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(f)  Var(X,+X.+. . .+X  ) 

X z n 


E?  ,Var(X.)  + 2 E Cov(X,  ,X,) 
1-1  1 Kj  1 J 


(g)  If  X and  Y are  Independent,  Cov(X,Y)  ■ p(X,Y)  * 0 


(h)  If  Independent,  then  Var (Xj^+X^+a . .+X^)  ■ Var(X^) 

(1)  p(aX  -I-  b,  cY  d)  - 


p(X,Y)  if  ac  > 0 
-p(X,Y)  If  ac  < 0 
undefined  if  ac  ■ 0 


(J)  If  Y - aX  + b where  a 0,  then  p(X,Y)  *1  if  a > 0, 
and  p(X,Y)  - -1  if  a < 0. 

The  correlation  coefficient  between  two  random  variables  X and  Y is 
a measure  of  the  amount  of  linear  relationship  between  them.  If  Y is  well 
approximated  (or  well  "predicted")  by  a linear  function  of  X,  say  a + bX, 
then  Ipl  is  close  to  1.  Otherwise,  p is  close  to  zero.  This  is  made  pre- 
cise by  the  following  theorem. 

Theorem  7-4.  Let  X and  Y be  random  variables  having  nonzero  variances. 

(a)  If  or  and  p are  the  values  of  a and  b which  minimize 


S(a,b)  - E[Y  - (a  + bX)]*^,  then  8 - cc  - E(Y)  - pE(X), 


and 


S(a,8)  - (1  - 

(b)  -1  a p(X,Y)  ^ 1,  and  |pj  1 if  and  only  If  there  exist  constants 

Of  and  8 such  that  P(Y  - of  + 8X)  “I* 

Proof:  (a)  E[Y  - (a+bX)]^  ■ EtCY-ji^)  - b(X-^)  - (a-ji^+b^jj) 

- + (an^Y+b^^)^  - 2b  Cov(X,Y) 

“ 2bpaxOy  + (1~p^)Oy^  + (®">Y 

Only  the  first  and  third  terms  depend  on  a and  b,  and  these  can  be  minimized 
by  setting  8 " pOy/Oj^  and  of  ■ ~ P^*x*  these  values  of  a and  b, 

E[Y  - (a-l-bX)]^  - (l-p^)Oy^. 
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2 2 2 
Since  both  E[Y  - (or  + 3x)l  and  Oy  are  nonnegative  and  Oy  1*  0, 

2 2 

it  follows  that  1-p  i 0,  implying  that  p il  or  |p|^l.  If  p"l, 

2 

then  E[Y  - (or  + px)]  ■ 0,  which  implies  that  Y - or  + pX,  except  perhaps 
on  a set  of  probability  zero. 

Since  min  E(Y  - a)^  - E(Y  - py)^  “ Oy^  (see  Exercise  7,  page  52) 

2 2 2 

and  mig  E[Y  - (a+bX)]  ■ (1  - p )Oy  , incorporating  the  random  variable  X 

into  the  "linear  predictor"  a + bX  reduces  the  lowest  attainable  mean 

2 2 2 2 

squared  prediction  error  from  Oy  to  (1  - p )Oy  . Thus  p is  the  pro- 
portional reduction  in  mean  squared  error  that  results  from  including  X in 
the  predictor. 

The  random  variable  or  px  referred  to  in  Theorem  7-4  is  sometimes 
called  the  best  linear  predictor  of  Y based  on  X.  The  line  y ■ of  + px 
is  called  Che  regression  line  of  Y upon  X.  This  line  can  be  written  in 
Che  form: 

y - E(Y)  _ X - E(X) 

Examples . 

1.  A fair  coin  is  tossed  3 times  in  succession.  Let  X be  1 or  0 
according  as  the  first  toss  results  in  heads  or  not,  and  let  Y be  the  number 
of  heads  on  all  3 tosses.  Then  X and  Y have  the  following  Joint  proba- 


X 

y 

0 

1 

py(y) 

0 

1/8 

0 

1/8 

1 

1/4 

1/8 

3/8 

2 

1/8 

1/4 

3/8 

3 

0 

1/8 

1/8 

Px(x) 

1/2 

1/2 

1 

bility  function: 
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Here,  E(X)  - 1/2,  E(Y)  - 3/2,  Var(X)  - 1/4,  and  Var(Y)  - 3/4.  Since 
E(XY)  - 1(1/8)  + 2(1/4)  + 3(1/8)  - 1,  Cov(X,Y)  - E(XY)  - E(X)E(Y)  - 1 - (1/2) (3/2) 
- 1/4,  and  p(X,Y)  - Cov(X,Y)/(^Oy  - 1//T’.  The  regression  line  of  Y 
upon  X is  . 

y - 3/2  - 1 U-U2 

nn  /r  \ 


which  can  be  written  in  the  form  y - x + 1.  Note  the  significance  of  the 
2 

value  of  p ~ 1/3  in  this  case. 

2.  Suppose  Y - X 4*  U where  X and  U are  Independent.  (For  example, 

in  Exercise  1,  U is  the  number  of  heads  on  the  last  two  tosses.)  In  this 

case,  Cov(X,Y)  - Cov(X,X  + U)  - Cov(X,X)  + Cov(X,U)  - and  p(X,Y)  - O^^/o^Oy 

* best  linear  predictor  of  Y based  on  X turns  out  to  be  X 4-  E(U). 

As  a special  case,  suppose  a coin  which  has  probability  p of  turning  up  heads 

is  tossed  n times.  If  X is  the  number  of  heads  on  first  m(<n)  tosses 

2 

and  Y is  the  number  of  heads  on  all  n tosses,  then  ■ nq>q  and 

2 

Oy  * npq  so  that  p > * /m/n.  Again  note  the  significance  of  the 

value  of  p^. 

3.  Suppose  X and  Y have  Joint 
density  f(x,y)  ■ 2 for  0 < y < x < 1, 
so  that  the  density  function  is  constant 
over  the  triangular  region  in  the  figure 
at  the  right.  Then 

E(XY) 

The  marginal  density  of  X in  this  case  is 

fx(x)  " /”f(x»y)dy  ■ /q  2 dy  - 2x  for  0 < x < 1, 

—00 

so  that  E(X)  - 2/3  and  Var(X)  * 1/18.  The  marginal  density  of  Y is 
- /y  2 dx  - 2(1  - y)  for  0 < y < 1 

so  that  E(Y)  -1/3  and  Var(Y)  - 1/18.  It  follows  that  Cov(X,Y)  - E(XY) 

- E(X)E(Y)  - 1/36  and  p(X,Y)  - Cov(X,Y)/<^Oy  - 1/2.  Thus,  the  regression 


IqIq  2xy  dy  dx  - /Jx^  dx  - 1/4. 
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llne  of  Y on  X Is  y * x/2. 

4.  Let  (X,Y)  have  joint  probability  function  p(Xj^.yj^)  ■ 1/n  vhere 

(Xi>yi)»  (X2ty2) »• • • «(x^,y^)  are  any  n points  on  the  plane.  Since  this 

situation  usually  arises  In  a context  where  the  pairs  (x^tY^^)  regarded 

as  being  a sample  from  a larger  population,  the  means  and  variances  of  X and 

Y are  called  sample  means  and  sample  variances  In  this  case,  and  special 

— 2 2 

notation  Is  Introduced:  x for  E(X),  s^  for  , and  r for  p. 

— 2 — 2 2 7 

Here,  x * Ex^/n,  s^  ■ E(x^  - x)  /n  "(EXj^  /n)-  x , and  similar  formulas 

— 2 

hold  for  y and  s^  . Omitting  the  subscripts  1 below,  we  can  write 

r ■ Cov(X,Y)/s  s where 
X y 


Cov(X,Y)  - E(x-x)(y-y)/n  - (Exy/n)  - x-y  » [Exy  - n ExEyl/n 


providing  a convenient  formula  for  hand  calculations. 


2 

Since  choosing  a and  p to  minimize  E(Y  - a - bX)  ■ E(yj. 


2 

amounts  to  choosing  of  and  p to  minimize  E(y^  - a - bx^) 


the  resulting 


regression  line  Is  called  Che  least 


squares  regression  line  in  this  case 
Applying  the  formulas  which  hold  for 


<*A»yA> 


4*^4 


any  regression  line,  we  see  that  the 


coefficients  of  the  regression  are 


given  by  p - ts  /s^  of  - y - px. 

y ^ 

since  r ■ Cov(X,Y)/s  s , p can  be  

X y 


y - px. 


X 


computed  using  the  formula 


Cov(X.Y)  ^ E(x-x) (y-y)/n  ^ Exy  - (l/n)ExEy 
s ^ E(x-x)^/n  Ex^  - (l/n)(Ex)^ 


For  example,  the  pairs  of  scores  on  the  left  below  result  from  comparing  14 
students'  diagnostic  test  scores  (x)  on  a simple  algebra  test  with  their 
final  exam  scores  (y)  In  a certain  statistics  course.  A plot  of  the  points 
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and  the  regression  line  Is  given  on  the  next  page.  Note  that  the  regreaalon 


line  passes  through  the  point  (x,y). 


X 

- 17080,  Ey^  - 51517,  Exy  - 29466 

36 

29 

49 

47 

X - Ex/n  - 484/14  - 34.57 

41 

34 

75 

60 

y “ Ey/n  - 839/14  - 59.93 

33 

31 

55 

47 

E(x-x)^  - Ex^  - (l/n)(Ex)^  - 17080  - 

(484)^/14  - 347.43 

43 

33 

79 

60 

L(y-y)^  - Ey^  - (l/n)(Ey)^  - 51517  - 

(839)^/14  - 1236.93 

35 

56 

23 

54 

E(x-x)(y-y)  ” Exy  - (l/n)ExEy  “ 29466 

- (484)(839)/14  - 460.57 

41 

65 

35 

37 

68 

59 

s ^ - 347.43/14  - 24.82 

X 

33 

65 

s ^ - 1236.93/14  - 88.35 

484 

839 

y 

8 - 460.57/347.43  - 1.326 

a « 59.93  - 1.326(34.57)  - 14.10 
r - 460. 57//C347. 43) (1236.93)  - 0.703 


• k 
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Exerclses. 

1.  Let  X and  Y have  the  Joint  probability  function  given  In  the 

example  on  page  69.  ' (a)  Show  that  Cov(X,Y)  * 1/2,  Var(X)  ■ 1/2,  and 
Var(Y)  “ 1,  so  that  p(X,Y)  - /l~ 11.  (b)  Show  that  the  regression  line  of 

Y on  X Is  y-x+1.  (c)  Show  that  the  regression  line  of  X on  Y Is 

X » y/2. 

2.  Let  X and  Y have  the  Joint  probability  function  In  Exercise  1, 

page  71.  (a)  Show  that  Cov(X,Y)  - 1/3  and  p(X,Y)  ■ -1/2.  (b)  In 
Exercise  1(c),  page  72  you  showed  that  Var(X  + Y)  “ Var(X)  ■ Var(Y)  ■ 2/3. 
Recompute  Var(X  + Y)  using  Theorem  7-3 (e) . (c)  Show  that  the  regression  line 
of  Y on  X Is  y ■ (3-x)/2. 

3.  Let  X and  Y have  the  Joint  density  function 

f(x,y)  “ X + y for  0<x<l,  0<y<l. 

(See  Exercise  2,  page  72.)  Show  that  Cov(X,Y)  - -1/144,  p(X,Y)  - -1/11,  and 
the  regression  line  of  Y on  X Is  y ■ (7-x)/ll. 

4.  By  Theorem  7-3 (g).  If  X and  Y are  Independent,  they  are  uncorrelated. 

The  converse  of  this  theorem  does  not  hold  In  general,  (a)  Show  that.  If 

X and  Y have  Joint  probability  function  p(0,0)  ■ p(-l,l)  ■ p(l,l)  - 1/3,  then  X 

Y are  uncorrelated  but  not  Independent,  (b)  Suppose  X has  a N(0,1)  dis- 
tribution and  Y - X^.  Show  that  p(X,Y)  ■ 0.  [Hint:  E(XY)  “ E(X^)  “0  in 

this  case.] 

5.  Let  X and  Y have  Joint  density  f(x,y)  « 1/2  for  0 < x < y < 2 

as  In  Exercise  3,  page  73.  Show  that  Cov(X,Y)  ■ 1/9,  p(X,Y)  •«  1/2,  and 

verify  that  the  regression  line  of  Y on  X Is  y - (x  + 2)/2. 

2 

6.  Show  that  the  constant  c which  minimizes  E(Y  - cX)  Is 

2 2 
c “ E(XY)/E(X  ).  Use  the  result  to  deduce  from  E(Y  - cX)  2 0 that 

|E(XY)|  i *4(X^)E(Y^).  (Cauchy-Schwarz  Inequality.) 
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Def Inltlon.  The  random  variables  X and  Y are  said  to  have  a bl- 
varlate  normal  distribution  with  parameters  T|>  a„>  oi.>  and  p where 

— — jt  y 

> 0,  Qy  > 0,  and  jpj  < 1 If  the  joint  density  of  X and  Y Is  given 
by  f(x,y) i exp  { [(^)^  - 2p(^)(J;^)  + 

Inoa^  ni-ph  °y 

* y 

This  density  function  has  a maximum  at  (x,y)  ■ bhe  coutours  of 

the  density  function  are  concentric  ellipses  centered  at  (§tll)  • Any  plane 
perpendicular  to  the  (x,y)  plane  cuts  the  surface  f(x,y)  In  a curve  of  the 
normal  form. 

Theorem  7-5.  If  X and  Y have  the  blvgrlate  normal  density  above, 

then  2 2 

(a)  X ~ N(§,  o ),  Y ~ N(l],a  ),  and  the  correlation  coefficient  of  X 
X y 

and  Y Is  p ; 


(b)  X and  Y are  Independent  If  and  only  If  p ~ 0 ; 

(c)  If  Z - a+bX+cY  where  either  b 0 or  c 0,  then  2 has  a normal 

2 2 2 2 

distribution  with  mean  a+b§+cH  and  variance  b a + c a + 2bcpo  o . 

X y * y 

Proof:  (a)  The  proof  that  X and  Y have  the  specified  marginal  dis- 

tributions follows  from  the  fact  that  f(x,y)  can  be  written  In  the  form 


fU.y)  - ^ 


«P 


'y-g-6x 


where  8 “ pa  /cr  » ® “ “H  - and  cp  Is  the  density  function  of  the 
y X 

standard  normal  distribution.  Integrating  out  y after  a change  of  variables 
to  V - (y  - Of  - px)/a  /l-p2  yields  f„(x)  - — q>  , which  Is  the  density 

2 XX 

of  a N(§,a  ) distribution.  A similar  proof  can  be  used  to  show  that 
2 

Y ~ N(l),ay  ).  The  proof  that  X and  Y have  correlation  coefficient  p 
will  be  given  later  In  this  section. 

(b)  The  joint  density  f(x,y)  factors  Into  the  marginal  densities  If  and 


only  If  p “ 0. 

(c)  See  Alexander  M.  Mood  and  Franklin  A.  Grayblll,  Introduction  to 
the  Theory  of  Statistics,  Second  Edition,  McGraw-Hill,  New  York,  1963,  p.  211. 
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Def Inltlon.  If  X and  Y have  Joint  probability  function  p(x,y), 
then  the  conditional  probability  function  of  Y given  X - x Is  defined 
by 


p(y|x) 


PjCx) 


provided  Pj^(x)  > 0* 


If  X and  Y have  joint  density  function  f(x,y),  then  the-  conditional 
density  function  of  Y given  X ■ x Is  defined  by 

f(y|x)  “ ~f"~(x')'  provided  fj[(x)  > 0. 


The  definition  for  the  discrete  case  is  motivated  by  the  fact  that 
P(Y  « y|X  - x)  - x)~  provided  that  P^(x)  > 0* 

The  definition  for  the  continuous  case  Is  motivated  by  a consideration  of 
the  conditional  distribution  function  of  Y,  given  X ■ x,  which  can  be 
defined  as 

F(y|x)  “ lin>  P(Y  i y|x-h  <X  < x+h) . 
hM) 

If  X and  Y have  a continuous  joint  density  function  £(x,y),  then  it 
can  be  shown  that  F(y|x)  “ f(x,y')  dy'/fj^(x).  (See  H.  Crandr, 

—CD 

Mathematical  Methods  of  Statistics,  Princeton  University  Press » 1946,  p.  268.) 
Taking  the  derivative  with  respect  to  y yields  f(y|x)  ■ f (x,y)/f^(x) . 

Note  that,  If  X and  Y are  Independent,  then  the  conditional  distribu- 
tion of  Y for  any  value  of  X Is  the  same  as  the  marginal  distribution 
of  Y. 

Definition.  The  conditional  distribution  of  Y,  given  X • x.  Is 
the  distribution  specified  by  the  conditional  distribution  function  P(y|x) 
defined  above  (or  by  p(y|x)  or  f(y|x)  In  the  discrete  or  continuous 
cases].  The  conditional  expectation  of  Y given  X ■ x,  denoted  by 


E(Y|x  - x),is  defined  to  be  the  expectation  of  the  conditional  distribution. 
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The  conditional  variance,  denoted  by  Var(Y|X  * x) , Is  defined  as  the 
variance  of  the  conditional  distribution. 

In  particular.  If  X and  Y have  joint  probability  function  p(x,y), 
then  the  conditional  expectation  of  Y given  X ■ x Is  given  by 
E(Y|X  - x)  - E y p(ylx) 

y 

for  those  values  of  x for  which  Py(x)  > 0 and  E|y|p(y|x)  < If 

X and  Y have  Joint  density  function  f(x,y),  then  the  conditional  expec- 
tation of  Y given  X ■ x Is  given  by 
E(Y|X  - x)  - /“  y f(y|x)  dy 

— oo 

for  those  values  of  x for  which  f„(x)  > 0 and  /|y|  f(y|x)  dy  < *. 

A 

If  X and  Y are  Independent,  then  the  conditional  distribution  of 

Y given  X - X Is  the  same  as  the  marginal  distribution  of  Y so  that 

E(y|x  “ x)  » E(Y)  and  Var(YjX  - x)  - Var(Y).  If  Y Is  some  function  of 
X,  say  Y “ g(X) , then  given  that  X >■  x,  the  conditional  distribution  of  Y 
Is  entirely  concentrated  at  the  point  g(x).  Hence,  In  this  case, 

E(y|X  - x)  - E(g(X)|X.-  x)  - g(x),  and  Var(Y|x  - x)  - 0. 

Definition.  Assume  that  E(Y|X  - x)  exists  for  all  x for  which 
p„(x)  > 0 lor  fy(x)  >0  In  the  continuous  case].  Then  the  conditional 
expectation  of  Y given  X,  denoted  by  E(y|X),  Is  the  random  variable 
having  the  value  E(y)x  - x)  when  X - x* 

In  particular.  If  Y * g(X),  then  E(y|x)  » g(X).  If  Y and  X 
are  Independent,  then  E(Y|x)  • E(Y).  Other  examples  will  be  given  below. 

Although  the  definitions  above  are  stated  for  the  case  that  the 
conditioning  variable  X Is  a random  variable,  the  definitions  could  just 
as  well  have  been  given  for  the  more  general  case  where  X Is  a "random 
vector,"  l.e.,  X - (Xj^,...,X^)  where  the  random  variables  X^^  are  all 

defined  on  the  same  sample  space. 
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Examples . 

1.  A fair  coin  is  tossed  four  times.  Let  X be  the  nuoiber  of  heads 
on  the  first  two  tosses,  and  let  Y be  the  number  of  heads  on  all  four 
tosses.  Then  the  Joint  probability  function  of  X and  Y Is  given  on 
page  69.  Given  X = 1,  the  conditional  probability  function  of  Y Is; 
p(0|l)  = 0,  p(l|l)  - 1/4,  p(2|l)  - 1/2,  p(3|l)  - 1/4,  and  p(4|l)  - 0. 

Thus,  E(y|x  “ 1)  =•  1(1/4)  + 2(1/2)  + 3(1/4)  ■ 2.  In  this  case,  the  random 
variable  E(Y|X)  has  value  1 when  X ■ 0,  2 when  X ■ 1,  and  3 when 

X « 2,  so  that  E(Y|X)  - 1 + X. 

2.  As  a generalization  of  example  1,  let  Y be  the  number  of 
successes  In  n + m Bernoulli  trials  with  probability  p of  success  on 
each  trial,  and  let  X be  the  number  of  successes  on  the  first  n trials. 
Then 


p(x,y) 


,n.  X n-x,  m y-x  m-(y-x) 

(,)P  q (y.,)p'  q 


(")  ( " for 

X y-x  ^ 


x * 0,1,..., n,  y * x,x+l , . . . ,x+m. 

Since  X has  a binomial  distribution  with  parameters  n and  p, 

Py(x)  * (”)p*q”~*,  and  it  follows  that  the  conditional  probability  function 
of  y for  given  x is 

p(y|x)  - for  y - x,x+l,. . . ,x+m. 

Therefore  the  conditional  distribution  of  Y Is  the  same  as  the  distribution 
of  X -f  Z where  Z has  a binomial  distribution  with  parameters  m and  p. 
It  follows  that  E(Y|X  - x)  » x + mp  and  Var(Y|X  ■ x)  - npq.  Note  that  in 
this  case  the  random  variable  E(Y|x)  * X + mp  Is  the  same  as  the  best 
linear  predictor  of  Y based  on  X.  (See  Example  2,  page  80.)  Theorem 
7-6(a)  below  states  the  general  result  that  E[E(Y|X)]  ■ E(Y) . This  can  be 
verified  directly  in  this  case  as  follows: 
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E[E(Y|X)]  - E(X)  + mp  - np  4-  mp  - (n  + m)  p - E(Y). 

3.  If  X and  Y have  a bivariate  normal  distribution,  then  the 

2 

) by  Theorem  7-5,  and  It 
follows  from  the  representation  of  the  bivariate  normal  density  f(x,y) 
at  the  bottom  of  page  VII-18  that  the  conditional  density  of  Y for  X 
Is 


marginal  distribution  of  X is 


f(y|x) 


^(x.y)  1 /y  - g - 8x\ 


X 


where  p ■ pOy/oj^.g  “ standard  normal  density 

function.  Thus,  given  X > x,  the  conditional  distribution  of  Y Is 
2 2 

N(or  + px,  (1-p  )),  and  the  conditional  expectation  of  Y given  X Is 
E(y|x)  ■ or  + px,  which  again  coincides  with  the  best  linear  predictor  of 
Y based  on  X.  In  this  case,  the  conditional  variance  of  Y given 
X 

Theorem  7-6.  Let  X and  Y be  random  variables  such  that  the  expecta- 
tions Indicated  below  exist. 


2 2 

X Is  Oy  (l“p  ) for  all  values  of  x. 


(a)  ElE(Y|X)]  - E(Y). 

(b)  E[g(X)|X]  - g(X). 

(c)  If  X and  Y are  Independent,  E(y|X)  - E(Y). 

(d)  E[g(X)h(Y)|X]  - g(X)  E[h(Y)|X]. 

(e)  For  any  constants  a and  b,  E[aY  + b|X]  ■ a E(y|x)  + b. 

(f)  If  U and  V are  random  variables  having  finite  expectations, 

then  E(U  + V|X)  « E(U|X)  + E(V|X). 

Proof:  (a)  In  the  discrete  case, 

E[E(y|x)]  - E[S  y p(y|x)]  p (x)  - E E y p(x,y)  -EYE  p(x,y)  - Ey  - E(Y). 

x y X y y * 
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A similar  proof  can  be  given  In  Che  continuous  case. 

(b)-(f).  Parts  (b)  and  (c)  were  proved  earlier.  The  proofs  of  the 
other  parts  are  omitted. 

Theorem  7-A  derived  the  best  linear  predictor  of  Y based  on  X in 
the  sense  of  mean  squared  prediction  error  and  showed  that 
E(Y  - a - pX)^  - (1  - p^)Oy^. 

2 

Theorem  7-7.  Among  all  functions  6(X),  E[Y  - 6(X)]  Is  minimized 
by  6(X)  - E(Y|X).  The  mean  square  prediction  error  Is  given  by 
E[Y  - E(y|X)]^  - (1  - where  T]  “ p(Y,  E(Y|X)).  [T|.^  Is  called  the 

correlation  ratio  of  Y and  X.]. 

Proof;  Let  g(X)  - E(Y|X)  - 6(X).  Then 
[Y  - 6(X)]^  - [Y  - E(Y|X)  + g(X)]^  - [Y  - E(Y|X)]^  + 2g(X)[Y  - E(Y|X)] + [g(X)J^ 
Since  the  next  to  the  last  term  on  Che  right  has  expectation  zero  by  parts 
(a)  and  (d)  of  the  previous  theorem, 

E[Y  - 6(X)]^  - E[Y  - E(Y|X)]^  + E[g(X)l^  S:E[Y  - E(Y|X)]^. 

The  fact  that  E[Y  - E(y|X)]^  - (1  - follows  Immediately  from. 

Theorem  7-4 (a)  by  observing  that  the  best  linear  predictor  of  Y based 
on  E(y|x)  Is  E(Y|X). 

Since  the  mean  squared  prediction  error  using  the  best  linear  function 

2 2 2 

of  X Is  E[Y  - Of  - pX]  ” (1  - p )Oy  where  p Is  the  correlation  coefficient 

2 2 2 2 
of  X and  Y,  It  follows  Chat  1 - p ^ 1 - T|  , Implying  that  p T)  . 

2 2 

If  It  happens  Chat  E(Y|X)  Is  Hnear  In  X,  then  p ■■  T]  and 

E(Y|X)  - Of  + pX  where  p - ® “ PUy*  particular.  If 

X and  Y have  a bivariate  normal  distribution,  then  It  was  shown  on  page 

89  that  E(y|X)  * of  + pX  where  p ■ po^/o^.  This  proves  a result  that 

was  stated  but  not  proved  earlier — namely,  that  the  parameter  p In  the 

bivariate tormal  density  function  Is  Che  correlation  coefficient  of  X and  T. 
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The  following  example  Illustrates  how  the  above  theory  Is  sometiaes 

applied  to  estimate  parameters  of  distributions  In  those  Instances  where 

the  parameters  themselves  can  be  considered  to  be  random  variables. 

Example.  You  can  observe  a sequence  of  n Bernoulli  trials  with 

probability  y of  success  on  each  trial.  Suppose  that  the  value  of  y 

Is  unknown,  and  you  want  to  guess  y based  on  the  number  of  successes, 

X,  In  n trials.  In  the  absence  of  any  Information  on  the  values  of  y, 

you  might  guess  y using  the  "estimator"  X/n.  This  estimator  has  expec- 

2 2 

tatlon  E(X/n)  ■ y and  variance  Var(X/n)  *•  Var(X)/n  - ny(l-y)/n  «■  y(l-y)/n. 

Now  suppose  that  you  are  Informed  (or  are  willing  to  assume)  that  the 
value  of  y was  randomly  generated  according  to  a distribution  having 
density  function  f(y)  on  (0,1).  That  Is,  y can  be  regarded  as  the 
value  of  a random  variable  Y having  density  function  f(y),  and  the  con- 
ditional probability  function  of  X given  Y - y Is 

p(x|y)  - (")  y*  (l-y)"“*  for  x - 0,1,..., n. 

Guessing  the  value  of  Y based  on  X amounts  to  "predicting"  Y using 
some  function  of  X.  If  the  mean  squared  prediction  error  Is  an  appropriate 
goodness  criterion  for  your  estimator,  then  the  theory  above  suggests  using 
E(Y|X)  to  estimate  y.  Here,  X Is  discrete  and  Y Is  continuous,  so  that 
the  Joint  distribution  of  X and  Y Is  neither  discrete  nor  continuous. 
However,  using  the  fact  that 

P(X  « A,  a < Y < b).  - L p(x|y)  f(y)  dy, 

xeA 

one  can  show  that  the  conditional  density  of  Y given  X ■ x Is 
, p(xlv)f (v)  _ t(y) 

PxW  /J  f(y)  d,  ■ 

Thus,  given  the  density  function  f(y),  one  can  compute  the  conditional 
expectation  of  Y for  any  "value  of  X.  In  particular.  If  f(y)  ■ 1 for 
0 < y < 1,  then 
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E(Y|x  - x)  - /J  y f(y|x)  dy  - y*^^  (l-y)““*  dy//J  y*  (1-y)"  * dy. 

Since  /J  y*^(l-y)^  dy  ■ ofl  pl/(a  + p + 1)1  for  a ■ 0,1,...,  p • 0,1,... 
(see  Alexander  M.  Mood  and  Franklin  A.  Grayblll,  Introduction  to  the  Theory 
of  Statistics.  Second  Edition,  McGraw-Hill,  New  York,  1963,  pp.  129-131), 

It  follows  that 


Frvlx  - - (x+1) » (n-x)  1 . (n+1)  1 _ Jc+l 

E(Y|X  - X)  - (^2)1  x!(n-x)!  iri-2 

Thus,  if  Y is  chosen  according  to  a uniform  distribution  on  (0,1),  then 
the  estimator  E(Y|X)  ~ (X+l)/(n+2)  has  the  smallest  mean  squared  predic- 
tion error  among  all  functions  of  X. 

Exercises. 

1.  Suppose  X and  Y have  Joint  density  function  f(x,y)  - 2 for 
0 < y < X < 1.  (See  Example  3,  page  80.)  Show  that  (a)  given  X ■ x 
the  conditional  distribution  of  Y Is  a uniform  distribution  on  (0,x), 

(b)  E(y|X)  = X/2,  and  (c)  E(X|Y)  - (l+Y)/2.  Verify  directly  that 
e[e(x|y)]  * E(X). 

2.  Let  X and  Y have  the  Joint  probability  function  given  at  the  top 

of  page  72.  Verify  that  (a)  E(Y|X  - 0)  - 3/2,  (b)  E(Y|X)  - (3-X)/2. 

2 

3.  Suppose  X has  a uniform  distribution  on  (-1,1),  and  Y - X . Show 
that  (a)  X and  Y are  uncorrelated,  (b)  the  regression  line  of  Y on  X 

is  y ■ 1/3,  (c)  the  correlation  ratio  between  X and  Y is  1. 

4.  If  the  conditional  variance  of  Y given  X Is  defined  by 

Var(Y|X)  - E([Y  - E(Y|X)]^1x), 


show  that 

(a)  Var(Y|x)  - E(Y^|X)  - [E(Y|X))^ 

(b)  Var(Y)  = E[Var(Y)x)]  + Var[E(Y (X) ] . 

5.  Show  that  if  X and  Y have  Joint  density  f(x,y)  ■ e ^ for 
0 < X < y < »,  then  E(y|X)  - X + 1,  E(x|Y)  - Y/2,  and  p(X,Y)  - 
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SECTION  VIII.  - SOME  SAMPLING  THEORY 


Reference: 

Paul  L.  Meyer,  Introductory  Probability  and  Statistical  Applica- 
tions, 2nd  Edition,  Addison-Wesley , 1960,  Chapters  12  and  13. 

By  a random  sample  of  size  n from  a population  having  distribution 
function  F is  meant  a sequence  of  n l.l.d.  (Independent  and  identically 
distributed)  random  variables  X^,  X^,  ...,  X^,  each  having  distribution 
function  F.  Such  a sample  might  result  from  choosing  an  element  at  random 
from  some  population,  observing  the  value  Xj^  of  some  characteristic  of 
the  element,  replacing  the  element,  choosing  a second  element, observing 
the  value  X^  of  the  same  characteristic  of  the  second  element,  and  so 
forth.  Alternatively,  the  sequence  X^^,  X^,  •••,  X^  may  result  from  ob- 
serving n Independent  trials  of  the  same  type.  For  example,  X^^  might 
be  the  sum  of  the  results  on  the  ^th  trial  when  two  dice  are  tossed  repeatedly. 
Or  Xj^,  X2,  ....  X^  might  be  the  waiting  times  between  n successive  tele- 
phone calls  coming  Into  an  exchange. 

In  most  statistical  applications  the  distribution  of  the  is  un- 

known and  one  attempts  to  make  Inferences  about  the  distribution  based  upon 

the  values  of  the  observations  X,,  X„,  ...,  X . For  example,  one  might  want 

1 z n 

to  estimate  the  mean  or  standard  deviation  of  the  distribution  from  which  the 
X^'s  are  drawn.  Quite  often  the  statistics  commonly  used  In  drawing  such 
Inferences  Involve  sums  or  averages  of  the  or  functions  of  the  X^’s. 
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Theorem  8-1.  Let  Xj^,  X2,  . . . , X^  be  1.1. d.  random  variables,  each 

2 

having  mean  and  finite  variance  q . 

(a)  If  X - (X,+X„+...4X  )/n,  then  E(X)  - u and  Var(X)  - c^/n. 

(b)  (Law  of  Large  Numbers)  For  any  6 > 0,  P(|X  - i €)  tends 
to  0 as  n becomes  Infinite. 

Proof;  (a)  E(X)  - (l/n)EE(X^)  » n^,/n  ■ j*. 

Var(X)  - (l/n^)L’Var(X^)  - na^/n^  - a^/n. 

(b)  By  Chebyshev's  Inequality  (see  page  50). 

P(|X  - ^*|  2 €)  i Var(X)/€^  - a^/o€^. 

The  last  member  on  the  right  tends  to  0 as  n 

— 2 — 

Since  Var(X)  ■ a /n,  as  n Increases  the  distribution  of  X be- 
comes more  and  more  concentrated  about  E(X)  ■ If,  Instead  of  consider- 

ing the  average  of  X,  , X»,  ...,  X , one  considers  the  sum  S - X,+X.,+. . .+X 

J-  ^ n n 1 2 1 

2 2 

then  ^ Var(S^)  ■ na  so  that  If  a > 0 the  distribution 

of  becomes  Increasingly  spread  out  as  n Increases 

Consider  the  “standardized**  variable  (S  - nu)/cr»^  • This  random 

n 

variable  has  mean  0 and  variance  1 for  all  values  of  n.  The  theorem 

below  states  that,  no  matter  what  the  Initial  distribution  of  the  X^^’s  Is, 

the  distribution  function  of  (S^  - n\i,)/a'/^  tends  to  the  distribution 

function  of  a standard  normal  distribution. 

Theorem  8-2.  (Central  Limit  Theorem.)  Let  Xj^,  X2,  ...  be  1.1. d. 

2 

random  variables  with  mean  j*  and  finite  variance  a >0,  and  let 

S ■ X,+X_+...+X  . For  any  constants  a and  b with  -»  s:  a < b aJ 
n 1 2 n 

S - npi 

11m  P(a  < — < b)  - j(b)  - 4(a) 

n-^  (P^ 

where  4 Is  the  distribution  function  of  a standard  normal  distribution. 
Proof;  See  Meyer,  op.  clt.,  pp.  252-253. 


-95- 


Thls  Is  a generalization  of  the  DeHolve-LaPlace  Central  Limit  Theorem 
stated  earlier  In  Section  VI  for  the  case  where  the  have  Bernoulli 

distributions.  The  theorem  suggests  that  for  "large"  values  of  n one  can 
approximate  probabilities  of  the  form  ^ follows: 


S - nji 

P(S  i k)  - P(  — 


^ ) - I ( iS-JUlk  ). 

(jr/n  cr^ 


Depending  on  the  distribution  of  the  X^'s,  this  "normal  approximation"  for 
sums  of  l.l.d.  random  variables  Is  usually  quite  good  even  for  relatively 


small  values  of  n (say,  n - 25  if  the  have  Bernoulli  distribu- 
tions with  p close  to  1/2,  and  n ■ 10  if  the  have  uniform  or  exponen- 
tial distributions).  If  the  have  normal  distributions,  the  approxima- 

tion Is  exact  because  In  this  case  It  can  be  shown  that  S also  has  a normal 

n 

distribution.  (See  Theorem  8-6  below.) 

Note  that,  since 


" "I*  V 

n ^ X — u 

cn^  o/  /n 

the  Central  Limit  Theorem  could  Just  as  well  have  stated  that  the  average 
of  n l.l.d.  random  variables  has  a limiting  normal  distribution  as  n 
becomes  Infinite. 

Example . Suppose  that  light  bulbs  have  lifetimes  In  hours  that  can  be 
assumed  to  have  a distribution  with  mean  1000  and  standard  deviation  500. 

Find  the  probability  that  the  average  lifetime  of  100  such  llghtbulbs  will 
be  greater  than  1100  hours. 

Let  X^,X^, . . . be  the  lifetimes  in  hours,  and  let  X - EX^/100. 

Assuming  that  the  X^'s  are  a random  sample  from  a distribution  having  mean 

1000  and  standard  deviation  500,  it  follows  that  E(X)  * 1000  and  ■ 500/ /lOO  ■ 50. 
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Hence , 

P(X  > 1100) 


1 _ r(X  - 1000  ^ 1100  - 1000^ 


50 


50 


1 - *(2.0)  - 0.023. 


Exercises. 

1.  Suppose  that  10  storage  batteries  ^2*  ^2*  '***  ^10  used  In 

the  following  way.  First,  Is  used  until  It  falls,  at  which  time  It 

Is  replaced  by  B2.  Then,  when  B2  falls.  It  Is  replaced  by  B^,  etc. 

If  these  batteries  are  chosen  at  random  from  a population  having  mean  life- 
time 12  hours  and  variance  2.5  hours,  what  Is  the  approximate  proba- 
bility that  the  total  time  of  operation  of  the  batteries  will  exceed  110 
hours?  Ans.  0.98. 

2.  Suppose  100  random  digits  are  generated.  That  Is,  100  Independent 

trials  are  conducted  In  which  one  of  the  digits  0,1, 2,..., 9 Is  chosen  at 
random.  Approximate  the  probability  that  (a)  the  digit  0 occurs  more 
than  15  times  among  the  100  random  digits,  (b)  the  sum  of  the  100  digits 
exceeds  500,  (c)  the  average  of  the  100  digits  lies  between  4.0  and  5.0. 

Ans.  0.03,  0.04,  0.92. 

3.  Suppose  that,  when  the  heights  of  300  plants  are  measured  to  the 
nearest  Inch,  the  rounding  errors  are  Independent  and  uniformly  distributed 
over  (-0.5,  0.5).  If  the  300  heights  are  averaged  after  rounding,  what 

Is  the  probability  that  the  magnitude  of  the  total  error  due  to  rounding 
exceeds  0.02?  Ans.  0.23. 

4.  In  pari-mutuel  wagering,  the  racetrack  (or  gambling  house)  takes 
a fixed  percentage  of  the  total  amount  bet  and  returns  the  rest  to  those 
who  have  bet  on  the  winning  horse.  For  example,  suppose  that  the  total 
amount  bet  on  a certain  race  Is  $6000,  of  which  $2000  Is  bet  on  Horse  #1, 
Including  your  $2  bet.  If  the  track  "take"  Is  $1000,  then  the  remaining 

$5000  Is  divided  up  among  the  holders  of  winning  tickets  on  Horse  #1.  Thus 
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the  "betting  odds"  on  Horse  #1  are  said  to  be  "5-to-2" — i.e.,  a $2  bet  will 
yield  a return  of  $5  for  a net  gain  of  $3.  Perhaps  a reasonable  assessioent 
of  the  probability  that  Horse  #1  will  win  is  the  proportion  of  the  total 
amount  of  the  money  that  is  bet  on  Horse  #1,  which  is  1/3  in  this  case.  If 
you  repeatedly  play  a game  like  this,  on  each  play  you  either  win  $3  with 
probability  1/3  or  lose  $2  with  probability  2/3,  (a)  find  the  expectation 
and  variance  of  your  "net  gain"  after  18  plays  of  the  game,  and  (b)  find  the 
approximate  probability  that  you  will  be  ahead  after  18  plays. 

Ans . -6,  100,  0.23. 

Theorems  8-1  and  8-2  above  are  usually  applied  in  situations  where  the 

random  variables  X, , X»,  ...,  X are  the  values  of  the  observations  them- 

12  n 

selves.  However,  the  theorems  apply  equally  well  to  transformations  of  the 
observations  in  the  following  sense. 

Theorem  8-3.  Let  Y,  “ g(X.)  where  X,,  X.,,  ...,  X are  i.l.d.  random 
i ° i 12  n 

variables,  and  let  Y ■ HY^^/n  and  T^  « If  Y^  has  mean  1)  ■ E[g(X)] 

2 

and  variance  t < ®,  then 

(a)  E(Y)  - T1  and  Var(Y)  - r^/n, 

(b)  for  any  €>O,P(lY-Tl|i0->-O  as  n-*-“>, 

(c)  for  any  constants  a and  b with  -»  s a < b S “ 

Tn-rni 

lim  P(a  < < b)  ■ $(b)  - |(a). 

n-H*> 

Proof:  Since  the  random  variables  Y^^,  Y2,  ...»  Y^  are  I.l.d.  [the 

Independence  of  the  obvious  generalization  of  Theorem  7-2 (a) 

these  results  follow  immediately  from  Theorems  8-1  and  8-2 . 

The  above  theorems  are  of  fundamental  Importance  in  statistics.  Given 
a random  sample  Xj^,  X2,  . . . , X^  from  a distribution  having  unknown  mean 
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^ and  variance  a > one  can  estimate  using  the  value  of  the  estimator 

X.  That  Is,  If  the  observed  values  of  X. ,X>,  X are  x. , x_,  x , 

1 z n X z n 

then  the  estimated  value  of  for  that  particular  sample  is  x • Ex^/n, 

the  observed  value  of  X.  The  goodness  of  an  estimator  is  usually  measured 

by  the  extent  to  which  the  distribution  of  the  estimator  Is  concentrated 

around  the  parameter  being  estimated.  As  we  shall  see  later,  there  may  be 

other  estimators  that  are  better  than  X In  particular  instances,  but  X 

has  certain  appealing  properties.  By  Theorem  8-1,  the  distribution  of  X 

Is  "centered"  at  ^ In  the  sense  that  E(X)  ■>  for  all  values  of  Also 

the  variance  of  X Is  only  1/n  times  as  large  as  the  variance  of  each  of 

the  original  X^'s.  If  n Is  large  enough,  X Is  approximately  a /n) 

2 

so  that.  If  a Is  known,  one  can  use  the  normal  approximation  to  approximate 
the  probability  that  X will  deviate  from  4 by  more  than  any  prespecified 
amount  e > 0: 

P(lx  - v*|  > e)  - P 

2 ' 2 

If  a Is  unknown,  one  can  derive  an  estimator  of  o using  the  fact 
2 2 2 

that  a ■ E(X  ) - n where  X has  the  same  distribution  as  the  X^^’s.  By 
2 2 

Theorem  8-3,  EX^j^  /n  has  expectation  E(X  ).  Therefore,  one  can  estimate 
cr^  using 

- (2X^^/n)  - 

where  is  some  estimator  of  If  one  uses  4 " X,  the  reeultlng  esti- 

mator is  the  "sample  variance" 

- (EXj^^/n)  - X^  - E(X^  - X)^/n. 

2 2 2 2 
However,  S Is  a "biased"  estimator  of  a [i«s<>  a } because 

E(X^)  « Var(X)  -f  E^(X)  ■ (o^/n)  + implying  that 

Eish  - E(X^)  - (o^/n)  - ,*2  - . <y^(n-l)/n. 


lXlkl>_X. 

10/ a/r/nj 


2i 


al/ni 
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2 

To  obtain  an  unbiased  estimator,  we  can  multiply  S by  n/(n-l),  yielding 

the  following  alternative  estimator: 

2 E(X^  - X)^  EXj^^  - nX^ 

° “ n^i 

Theorem  8-4.  Let  X, , X„,  i...  X be  1.1. d.  random  variables  having  mean 
■ i z n 

H and  variance  cr^. 

2 2 

(a)  The  sample  variance  S ■ E(X^^  - X)  /n  has  expectation 
E(S^)  - a^(n-l)/n. 

(b)  An  unbiased  estimator  of  o Is  o “ E(Xjj^-X)  / (n-1) . 

^2 

(c)  An  unbiased  estimator  of  Var(X)  Is  a /n. 

Part  (c)  above  enables  us  to  attach  a measure  of  reliability  to  X as 

2 2 

an  estimator  of  ^ even  If  or  Is  unknown.  If  o Is  known,  the  standard 

2 

deviation  of  X is  a/*/n  . If  a is  unknown,  the  variance  of  X can  be  esti-» 

2 2 

mated  by  a /n  (or  S /n) . The  square  root  of  the  estimaced  variance  (called 
the  standard  error  of  X)  is  an  estimate  of  the  standard  deviation  of  X. 

The  reader  should  not  infer  from  the  above  that  the  estimators  X and 
a are  necessarily  good  estimators  In  all  circumstances.  Nor  Is  It  the 

a2 

case  that  the  unbiased  estimator  a Is  necessarily  preferable  to  the  biased 
2 

estimator  S . In  the  next  section,  examples  will  be  given  to  Indicate  that 
^2 

both  X and  o can  often  be  Improved  upon,  depending  on  the  nature  of  the 

2 

distribution  from  which  the  random  sample  Is  taken.  Also,  S la  a better 
^2 

estimator  than  a according  to  a certain  goodness  criteria  that  will  be 
Introduced  later. 

Definition.  Let  X^^,  X2,  . . . , X^  be  a random  sample  from  a population 
having  distribution  function  F.  The  order  statistics  corresponding  to  the 
random  sample  are  defined  by 
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* xnln (Xj^ 9X2  9 • • • f X^)  9 
X^2)  * next  largest  of  the  X^’s, 

• • • 

" ni3x(Xj^ *^2 * * " * * 

The  sample  range  is  defined  by  R - the  sample  median  by 

^I(n+l)/2]  " is  odd  and  d/2)  ■*■  *(n/2  + 1)^  “ Is  even. 

The  sample  (empirical)  distribution  function  of  X^^,  X2f..,X^  Is  defined 
for  all  X by 

number  of  X. *s  having  value  ^ x 

F (X)  . ^ , 

n n 

For  given  values  of  the  X^'s,  the  sample  c.d.f.  (cumulative  distribu- 
tion function)  Is  a step  function  having  jumps  of  size  1/n  at  the  values 
X(l)’  ^(2)  ’ * * * ’^(n)  * figure  beJLow  depicts  the  case  n « 4. 


n 


sample  to  sample.  Let  x ^ random  variable  having  this  (conditional) 

distribution  function.  Then  the  conditional  expectation  of  x given  the 

sample  random  variables  X^^,  X2f*»f  Is  " EX^/n  • Xt  and  the 

— 2 

conditional  variance  of  x sample  variance  E(Xj^  - X)  /n. 

— 2 

Just  as  the  sample  mean  X and  the  sample  variance  S can  be  con- 

2 

sldered  as  estimators  of  the  population  mean  and  variance  a t the 
sample  c.d.f.  can  be  considered  as  an  estimator  of  the  population 
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dlstrlbutlon  function.  The  following  theorem  shows  that  F Is  an  unbiased 

n 

estimator  of  F,  and  as  n becomes  Infinite,  F tends  to  F for  all  values 

n t 

of  X. 


Theorem  8-5.  Let  F^  be  the  empirical  distribution  function  of  a 
random  sample  Xj^,  X2,...,  X^  from  a population  having  distribution  function 
F . Then 

(a)  E[F  (x)]  “ F(x)  for  all  x, 

n 

(b)  Var  (F^(x)]-  F(x)[l  - F(x)]/n, 

(c)  P(|f_(x)  - F(x)|  2s  e)  -*•  0 as  n <»  for  all  values  of  x and  any  e > 0. 

n 

Proof:  Note  that  ^ where  “ 1 or  0 according  to 

whether  X^  i x or  X^  > x.  Here  has  a Bernoulli  distribution  with 

parameter  p ■ P(X^  i x)  - F(x).  By  Theorem  8-3,  E[F^(x)]  ■ E(Y)  ■ F(x)  and 

Var[F  (x)]  * F(x)  (l-F(x)  ]/n,  which  tends  to  zero  as  n->«. 
n 

Exercises. 

1.  A random  sample  of  size  25  Is  taken  with  the  result  that  Ex^  - 50 

2 — 2 
and  Ex^  ■ 200.  Compute  the  values  of  (a)  X,  (b)  S , (c)  ^ , (d)  a/*^. 

Ana.  (a)  2,  (b)  4,  (c)  25/6,  (d)  0.41. 

2.  Show  that  If  Xj^,  X2 X^  are  Independent,  Bernoulli  random 

2 

variables  with  parameter  p,  then  the  formula  for  S In  this  case  re- 

2 

duces  to  S ■ X(1  - X),  and  the  standard  error  of  X Is  "*^/x(l  - X)/(n-l)‘ 

3.  Let  Xj^,  X2 ^100  ® random  sample  of  100  IQ  scores  from  a 

normal  distribution  having  unknown  mean  but  a known  standard  deviation 
a ~ 16.  In  this  case  X has  a normal  distribution  by  Theorem  8-6  below. 

(a)  Compute  P(|X  - ^|  ^2).  (b)  Suppose  you  can  choose  a larger 

sample  size  to  Increase  the  reliability  of  X.  How  large  a sample  would 
you  need  to  assure  that  P(|X  - i 2)  i 0.95?  Ans.  (a)  0.79,  (b)  246. 
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4.  Let  and  Y. , Y_,...Y  be  two  Independent  random 

1 z m X Z Q 

samples  such  that  E(X^)  * Var(X^)  ■ <^ , E(Yj)  ■ T],  and  Var(Y^)  ■ 


Individuals  who  had  received  a special  treatment  of  some  kind,  and  the 
X^'s  are  the  corresponding  responses  for  Individuals  In  the  control  group. 

Let  6 “ T)  - § be  the  average  effect  of  the  treatment. 

(a)  Show  that  an  unbiased  estimator  of  the  treatment  effect  Is 
6 “ Y - X,  and  Its  variance  Is  Var(6)  “ (t  /n)  + (a  /m) . 

(b)  Show  that  an  unbiased  estimator  of  Var  (S)  Is  (t  /n)  + (a  /m) 

where  - E(X^-X)  ^/(m-1)  and  - E(Y  -Y) (n-1) . 

2 2 2 
Xc)  If  a ~ T , show  that  an  unbiased  estimator  of  a Is  the 

"pooled"  estimator 

2 E(Xj^-X)^  + E(Yj-Y)^ 

^ n-hn-2 

Let  X^,  X2,...,X^  be  n random  variables  defined  on  the  same  sample 
space.  For  certain  statistical  applications.  It  Is  necessary  to  derive  the 
exact  distributions  of  certain  functions  of  the  X^'s,  such  as  X,  Ea^X^, 
maxXX^,  X2,...,X^),  etc.  There  are  certain  standard  techniques  for  deriving 
such  distributions  that  are  treated  In  most  statistics  texts.  (See,  for 
example,  Robert  V.  Hogg  and  Allen  T.  Craig,  Introduction  to  Mathematical 
Statistics . Second  Edition,  The  Macmillan  Company,  New  York,  Chapter  4.)  You 
have  already  used  one  general  technique  several  times  In  deriving  density 
functions  of  transformed  variables  by  first  finding  their  distribution  functions 
and  then  taking  derivatives.  With  only  a few  exceptions  below,  we  shall  not 
need  the  other  standard  techniques  for  the  distribution  theory  In  this  course. 


and  appropriate  references  will  be  cited  when  results  are  given  without  proof. 
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The  following  theorem  states  some  results  about  the  exact  distributions 
of  sums  of  random  variables.  Many  of  these  results  ar|^  somewhat  obvious  from 
the  discussion  of  the  models  In  Section  VI.  For  notatlonal  convenience  below 
we  shall  use  abbreviations  such  as  "X  Binomial (n,p)"  to  denote  "X  has  a 
binomial  distribution  with  parameters  n and  p." 

Theorem  8-6.  In  each  of  the  following,  assume  that  X^^,  X2,  ...,  X^ 
are  Independent  random  variables. 

(a)  If  Xj^  ~ Binomial(n^,p)  , then  Z3C^  ~ Binomial (iii^^.p)  . 

(b)  If  X^  ~ Poisson  , then  EX^^  ~ Poisson  (EX  . 

(c)  If  Xji^  ~ Geometrlc(p)  , then  EX^^  ~ Negative  Binomial (n ,p)  . 

(d)  If  X^  ~ Negative  Blnomlal(r^,p) , then  EX^^  ~ Negative  Binomial (Er^, p) . 

(e)  If  X^  ~ N(n^,a^^),  then  Ea^X^  ~ NCEa^^i^.Eaj^^O^^)  . 

(f)  If  X^  ~ Gamma(r^,X),  then  EX^  ~ Gamma (Er^, X)  • 

(g)  If  Xj^  ~ Cauchy then  Ea^^X^  ~ Cauchy (Ea^^i^.Ea^Xj^)  • 

Proof : 

(a)  Consider  a sequence  of  Bernoulli  trials  with  probability  p of 
success  on  each  trial.  Let  X^  be  the  number  of  successes  on  the  first  n^ 
trials,  X2  the  number  of  successes  on  the  next  n2  trials,  and  so  forth. 

Then  Xj^,  X2»...,X^  are  Independent  and  X^  ~ Binomial  (n^, p)  . Since  EX^^ 

Is  the  total  number  of  successes  on  all  En^  trials,  EX^  has  a binomial 
distribution  with  parameters  En^  and  p. 

(b)  Consider  Y - Xj^  + X2  where  X^^  and  X2  are  Independent, 

Xj^  ~ Polsson(X),  X2  ^ Poisson(|*).  It  suffices  to  show  that  Y ~ Poisson (X+ii) , 
since  the  result  for  the  sum  of  n random  variables  then  follows  by  mathe- 
matical Induction,  For  y * 0,1,2,... 


P(Y  - y) 


V y e e ^ 
x-0  xl  (y-x) I 


y-x 


X2  “ “ 


(X'+u)^ 


(c)-(d)  These  can  be  proved  as  In  (b)  above. 

(e)-(g)  In  general,  if  T ■ U + V where  U and  V are  Independent 
random  variables  having  density  functions  g(u)  and  h(v)  then  T has 
density 


and  the  derivative  of  the  double  integral  on  the  right  is  f(t).  With  this 
simplification,  the  derivation  of  parts  (e)-(g)  is  straightforward  but  tedious, 
and  the  proofs  are  omitted.  The  proof  of  (e)  for  the  case  n * 2 Is  a 
special  case  of  Theorem  7-5(c),  which  states  that  linear  functions  of  random 
variables  having  a bivariate  normal  distribution  have  normal  distributions. 

Note  that  by  part  (f)  of  the  theorem  that  If  Xj^,  X2,«..,X^  are  1.1. d.  and 
X^  ~ Cauchy (^,X) , then  X has  exactly  the  same  distribution  as  each  of  the 
Individual  X^'s.  Hence,  In  this  case,  the  distribution  of  X does  not 
become  more  and  more  concentrated  as  n Increases  nor  does  the  distribution 
of  X become  increasingly  normal  as  n ®.  Why  Is  this  not  a counter- 
example to  Theorems  8-1  and  8-2? 


f(t)  - /”  g(u)  h(t-u)  du. 


—00 


because  the  distribution  function  of  T Is 


F(t)  - P(T  i t)  - P(U  + V t)  =-  /"  g(u)  h(v)  dv  du 
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2 

Definition.  A random  variable  X is  said  to  have  a chi-square (y  ) 

2 

distribution  with  n degrees  of  freedom  [abbreviated  X~x  (n)]  if  X 

XI  2 

has  the  same  distribution  as  E.  ■.  Z.  where  Z.  , are  inde- 

l**l  1 1 4 n 

pendent  standard  normal  random  variables. 

Random  variables  having  chi-square  distributions  occur  frequently  in 

2 

statistical  applications.  In  particular,  in  sampling  from  a N(^,o’  ) dls- 

2 ^2 

trlbutlon,  the  estimators  S and  a Introduced  earlier  in  this  section 
are  both  multiples  of  chi-square  distributed  random  variables.  This  applica- 
tion will  be  discussed  later  in  this  section.  The  reason  for  calling  the 
parameter  n the  number  of  "degrees  of  freedom"  will  become  clear  later. 

For  now  the  student  should  Ignore  this  peculiar  terminology  and  merely  re- 

2 

gard  the  parameter  n as  the  number  of  terms  in  the  sum  E . 

2 

Theorem  8-7 . If  X ~ x (n) , then 

(a)  X has  a gamma  distribution  with  parameters  r ■ n/2  and  X ■ 1/2, 

(b)  E(X)  - n,  Var(X)  - 2n. 

Proof : 

(a)  If  n ■ 1,  the  distribution  function  of  X for  x > 0 Is 

F(x)  - P(X  i x)  - P(Z^  sc  x)  - P(-»^  < Z < ^c)  = {(»^)  - - 2f(/x)  - 1. 

Therefore,  the  density  of  X Is 

f(x)  - F'(x)  - 2tp(»^)/(2.^)  - (1/v^)  e"^^^  for  x > 0. 

Comparing  this  with  the  density  of  a Gamma(l/2,l/2)  distribution  (see  Table  1, 

Section  VI)  and  recalling  that  r(l/2)  - ^4r^  we  see  that  X ~ Gamma(l/2,l/2) . 

2 

It  follows  from  Theorem  8-6(f)  that  E Z^  ~ Gamma (n/ 2, 1/2 ) . 

(b)  This  follows  from  the  fact  that  the  expectation  and  variance  of  a 

2 

Gamma(r,X)  distribution  are  r/X  and  r/X  • (See  Exercise  1,  page  64.) 
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The  figure  on  the  next  page  shows  the  graphs  of  the  chi-square  density 
functions  for  1,  2,  4,  and  6 degrees  of  freedom.  As  the  number  of  degrees 
of  freedom  Increases,  the  density  function  becomes  more  symmetric  about  its 
mean.  Since  the  chi-square  distribution  Is  the  distribution  of  a sum  of  1.1. d. 
random  variables,  It  has  a limiting  normal  distribution  by  the  Central  Limit 
Theorem.  The  normal  approximation  to  the  chi-square  distribution  becomes  quite 
good  for  n 20. 

Table  1 on  the  following  page  gives  the  values  of  x for  which  the 
distribution  function  F(x)  of  a chi-square  distribution  has  certain  speci- 
fied values.  Suppose  n •>  20.  Then  the  entry  31.4  In  the  20th  row  under  the 

2 

column  headed  .950  means  that  If  X (20),  then  P(X  < 31.4)  • 0.95.  An 
equivalent  way  of  saying  the  same  thing  Is  to  say  that  31.4  Is  the  95th  per- 
centile (or  percentage  point)  of  a chi-square  distribution  with  20  degrees  of 
freedom. 

How  well  does  the  normal  approximation  work  In  this  case?  Since  E(X)  ■■  20 

and  Var(X)  ■ 40,  the  normal  approximation  of  P(X  < 31.4)  Is  given  by 

P(X  < 31.4)  • ^(31.4  - 20^  _ j(i.80)  - 0.96. 

/40 

This  Is  within  0.01  of  the  actual  probability  0.95  In  this  case. 

2 

Theorem  8-8.  If  X2,...,X^  are  l.l.d.,  each  N(^,a)»  then 

(a) 

(b)  ~ X)^/a^  has  a x^(**“3)  distribution  and  Is  Independent  of  X. 

Proof:  (a)  Set  - (X^^  - n)/a.  Then  Z^,...tZ^  are  l.l.d.,  each 

N(0,1).  Therefore,  S Z^  ■ L(X^  - p)^/a^  ~x^(“)* 

(b)  The  proof  of  (b)  will  be  omitted,  but  Its  plausibility  la  clear  from  . 
the  following  considerations.  First,  one  can  verify  directly  that 
E(K^  - n)^  - E(Xj^  - X)^  + n(X  - pk)^. 


\ 


Table  1 

PERCENTAGE  POINTS  OF  A CHI-SQUARE  DISTRIBUTION 


r 

n 

.005 

.010 

.025 

850 

.100 

850 

800 

.750 

.900 

.950 

.975 

.990 

.995 

1 

.0»393 

.0*157 

.0*083 

.0*393 

.0158 

.103 

.455 

182 

2.71 

384 

5.02 

6.63 

788 

7 

.0100 

.0201- 

.0506 

.103 

811 

875 

189 

2.77 

4.61 

5.99 

788 

981 

10.6 

3 

.0717 

.115 

816 

852 

884 

181 

287 

4.11 

685 

7.81 

985 

U.3 

128 

4 

.207 

897 

.484 

.711 

1.06 

1.92 

386 

589 

7.7S 

9.49 

11.1 

138 

14.9 

5 

.412 

854 

831 

1.15 

1.61 

2.67 

485 

0.63 

9.24 

11.1 

12.8 

15.1 

16.7 

e 

.67$ 

.872 

184 

1.64 

280 

3.45 

585 

7.84 

lO.ft 

12.6 

14.4 

16.8 

1S.5 

7 

.989 

184 

1.C9 

3,17 

283 

485 

685 

9.04 

12.0 

14.1 

lO.O 

18.5 

208 

• 

1.34  • 

1.05 

2.18 

3.73 

3.49 

5.07 

784 

10.2 

13.4 

15.5 

17.5 

20.1 

22.0 

9 

1.73 

2.09 

2.70 

3.33 

4.17 

6.90 

8.34 

11.4 

14.7 

1C.9 

19.0 

21.7 

23.6 

10 

2.16 

28$ 

3.25 

3.94 

487 

6.74 

084 

12.5 

16.0 

18.3 

• 20.5 

23.2 

258 

11 

2.60 

3.05 

3.83 

487 

5.58 

788 

108 

13.7 

17.3 

19.7 

21.9* 

24.7 

268 

19 

3.07 

387 

4.40 

5.23 

680 

8.44 

118 

14J 

18.5 

31.0 

238 

26.2 

288 

13 

3.57 

4.11 

5.01 

589 

7.04 

9.30 

128 

16.0 

19.8 

22.4 

24.7 

77.7 

298 

14 

4.07 

4.66  • 

5.03 

687 

7.79 

108 

138 

17.1 

21.1 

23.7 

26.1 

29.1 

318 

15 

4.C0 

583 

686 

786 

8.53 

11.0 

14.3 

188 

228 

25.0 

27.5 

30.6 

32.8 

1ft 

5.14 

5.81 

0.91 

7.96 

9.31 

11.9 

15.3 

19.4 

23.5 

268 

23.S 

32.0 

34  .3 

17 

5.70 

6.41 

786 

8.67 

10.1 

12.8 

168 

20.5 

24.8 

27.6 

30.3 

33.4 

35.7 

18 

6.26 

7.01 

8.23 

989 

10.9 

13.7 

178 

21.6 

268 

38.9 

318 

348 

378 

19 

$.84 

7.63 

8.91 

10.1 

11.7 

14.6 

188 

22.7 

278 

30.1 

32.9 

368 

38.6 

30 

7.43 

886 

989 

10.9 

13.4 

15.5 

10.3 

23.8 

28.4 

31.4 

348 

37.6 

40.0 

31 

8.03 

8.90 

108 

11.6 

138 

168 

208 

24.0 

29.0 

32.7 

35.5 

38.9 

41.4 

33 

8.C4 

984 

11.0 

138 

148 

17.2 

218 

26.0 

30.8 

33.9 

308 

40.3 

428 

33 

9.26 

108 

11.7 

13.1 

148 

18.1 

228 

27.1 

33.0 

35.3 

38.1 

41.6 

44.2 

34 

989 

109 

13.4 

138 

15.7 

19.0 

238 

28.2 

338 

36.4 

30.4 

43.0 

45.6 

35 

108 

118 

13.1 

14.6 

108 

19.0 

248 

298 

34.4 

37.7 

40.C 

44.3 

40.9 

3ft 

11.3 

128 

138 

15.4 

178 

30.8 

258 

30.4 

35.6 

38.9 

41.9 

45.6 

488 

27 

118 

12.9 

14.6 

10.2 

18.1 

31.7 

308 

318 

36.7 

40.1 

438 

47.0 

49.6 

38 

12.5 

13.6 

158 

16.9 

18.9 

23.7 

278 

.32.0 

37.9 

418 

448 

48.3 

.61.0 

39 

13.1 

148 

16.0 

17.7 

198 

23.0 

388 

33.7 

39.1 

4X6 

45.7 

49.6 

528 

30 

13J 

15.0 

168 

188 

2ao 

248 

298 

348 

408 

438 

47.0 

50.9 

53.7 
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2 

Dividing  both  sides  by  a and  rewriting  the  last  tern  yields 
E(X.  - 2(X.  - X)^  - 2 

S S + (r^^) 

a a <sJ'^ 

2 

By  part  (a),  the  le£t  nember  has  a x distribution.  The  second  tern  on 

2 

the  right  has  a x distribution  since  it  is  the  square  of  a standard 

nomal  randon  variable.  This  suggests  (but  does  not  prove)  that  the  first 

2 

tern  on  the  right  has  a x distribution.  As  for  the  Independence  of 

LCX.^  - X)^  and  X,  this  is  plausible  since  X^  - X and  X are  Independent 
for  each  1.  The  reason  is  that  X^  - X and  X can  be  shown  to  have  a bi- 
variate nomal  distribution.  Hence,  to  check  their  Independence  it  suffices 
to  show  they  are  uncorrelated: 

Cov(X^  - X,  X)  - Cov(X^,X)  - Var(X)  - Cov(X^,Ex^/n)  - <J/n 

- (l/n)Cov(X^,X^)  - a^/n  - a^/n  - a^/n  - 0. 

For  a rigorous  proof  of  this  theoren,  see  H.  Crandr,  Mathematical  Methods  of 

Statistics,  Princeton  University  Press,  Princeton,  N.  J.,  1946,  Chapter  29. 

2 

Example.  Suppose  you  have  a random  sample  of  size  30  from  a N(pb»a  ) 

2 

distribution.  If  a ~ 10,  what  is  the  probability  that  the  sample  variance 
S^  - E(X^  - X)^/30  will  exceed  15? 

Solution:  P(S^  > 15)  - P{Z(X^  - X)^  > (15) (30)}  - P{X(X^  - X)^/10  > 45}. 

From  Table  1,  we  see  that  45  is  between  the  95th  and  97.5th  percentage  points 
(42.6  and  45.7)  of  a chi-square  distribution  with  29  degrees  of  freedom. 

Using  linear  interpolation,  P(S^  > 15)  ■ 1 - 0.97  ■ 0.03. 

2 

Exercise.  1.  (a)  Given  a random  sample  of  size  30  from  a N(^,a  ) 

2 2 2 

distribution,  find  values  c and  d such  that  P(ca  < S < do  ) ■ 0.95. 

(Ans.  0.53,  1.52.)  Note  that  it  follows  from  this  that  P(S^/d  < < S^/c)  ■ 0.95, 

2 

That  is,  the  unknown  parameter  value  a lies  between  the  random  endpoints  of 
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the  interval 


(S^/d,  S^/c) 


with  probability 


0.95. 


(b)  Using  the  fact  that  S(X^  - X)^/a^  ~ ')f(n-l) , show  that 
Var(S^)  - 2(n-l)aW. 

2 2 

2 . Show  that  if  U and  V are  Independent  with  U ~ x M eod  V ~ x » then 

U+V  follows  that  if  Xj^,  X2»...»X^  and  ¥2*  . . . are 

independent  random  samples  from  two  normal  distributions  that  have  the  same 
2 

variance  <7  « then 

[Z(X^  - X)^  + E(Yj  - Y)^]/a^  ~x^(n-hn-2). 
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SECTION  IX  - PARAMETER  ESTIMATION 

In  many  statistical  applications,  the  experimental  data  consist  of 

observations  x, which,  according  to  some  mathematical  model, 

1 / n 

can  be  regarded  as  values  of  random  variables  Xj^,X2» . . . ,X^  having  a 
joint  distribution  which  depends  on  a vector  of  unknown  parameters 
ti  * (Oj^»02»  • • • • Quite  often  the  purpose  of  the  experiment  Is  to  use 

the  observations  to  estimate  the  values  of  one  or  more  of  the  parameters 
or  perhaps  some  function  of  the  parameters  g(0)<  At  this  stage,  we 
shall  not  question  the  appropriateness  of  the  mathematical  model, 
but  the  student  should  be  aware  of  the  fact  that  the  goodness  of  certain 
estimators  to  be  considered  below  depends  critically  on  the  assumption 

that  the  joint  distribution  of  the  observations  Xj^  ,X2 , . . . ,X^  Is  correctly 
specified.  Although  there  are  many  statistical  techniques  for  testing 
the  appropriateness  of  statistical  models,  a comprehensive  discussion  of 
model-bulldlng  and  methods  for  assessing  appropriateness  of  models  Is 
beyond  the  scope  of  this  course. 

To  simplify  notation  below,  let  X^”^  denote  the  vector  of  observa- 
tions (X,  ,X-,...,X  ),  and  let  x^"^  - (x.  ,x„, . . . ,x_,)  denote  the  value 
1 z n X z n 

of  X^”^  for  a particular  experimental  outcome.  We  recall  that  an 
estimator  6 * 6(X^”^)  Is  some  function  of  the  observations  used  to 
estimate  a parameter.  It  Is  Implicit  In  this  definition  that  6 Is  a 
random  variable  that  depends  only  on  the  observations  X^^  and  the  values 
of  known  constants.  This  Is  meant  to  exclude  those  functions  of  the 
observations  that  depend  on  the  unknown  parameters  themselves.  The  value 
6(x^”^)  of  an  estimator  for  a particular  experimental  outcome  Is  called 
an  estimate  of  the  parameter. 
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Far  example,  suppose  that  Xj^,X2» . • . ,X^  are  l.l.d.,  each  N(^,a)» 

2 

where  and  a are  both  unknown.  Here,  the  vector  of  parameters 
specifying  the  distribution  of  X^^^  is  0 ■ (vk>a  )•  and  the  Joint 

density  of  the  observations  is  22 

(n)  . " 1 ■<*!■»*)  /2a  . 

f(x  ; u»a  ) - n I — e 

i-1  a 

2 

Examples  of  parameters  of  Interest  In  this  case  are  (a)  |^,  (b)  or  , (c)  o, 

(d)  1.28a,  the  90th  percentile  of  the  distribution,  and 

(e)  P(X  < c)  - §(^^),  the  proportion  of  the  population  having  x-values 

a 

below  c.  Some  estimators  of  are: 

(1)  X,  the  sample  mean, 

(2)  Xj^,  the  first  observation  only, 

(3)  1^(1)  average  of  the  smallest  and  largest  values 

In  the  sample, 

(4)  mdn(X^^^),  the  sample  median, 

(5)  ^(n-k+1)^^^*  ^ some  Integer  between  1 and  n/2, 

(6)  [X^2)'*'X^3j+*  • •+X^^_j^^l/(n-2),  the  average  of  the  observations  that 
remain  after  "trimming"  the  smallest  and  largest  observations  In  the  sample, 

(7)  6^»  the  estimator  which  Ignores  the  observations  and  estimates 
4 to  be  equal  to  some  preassigned  constant  c, 

(8)  pc  + (l-p)X  where  p is  some  value  between  0 and  1, 

(9)  max(X,0),  the  estimator  which  estimates  ^ using  x If 
X > 0 but  estimates  to  be  equal  to  0 If  x ^ o« 

Clearly,  In  any  particular  Instance,  there  are  infinitely  many 


estimators  that  can  be  proposed,  and  the  values  of  these  estimators  will 
have  wildly  different  values  for  the  same  experimental  outcome.  To 
narrow  down  the  class  of  estimators  that  might  be  considered  In  a 
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partlcular  Instance,  one  can  impose  various  criteria  that  sees  reason- 
able under  the  circumstances,  and  then  eliminate  those  estimators  that 
perform  poorly  according  to  the  standards  that  are  adopted.  The  dif- 
ficulty in  providing  a general  theory  of  estimation  is  that  the  goodness 
criteria  can  vary  widely  from  application  to  application.  It  is  not 
hard  to  conceive  of  applications  in  which  each  of  the  nine  estimators 
listed  above  for  ^ would  be  best  under  certain  circumstances.  Thus, 
for  example,  would  be  "better"  than  X if  the  value  of  Xj^  is 

available  now,  but  it  would  cost  $1000  to  get  each  additional  observa- 
tion, and  the  Increased  precision  is  not  worth  the  added  cost.  If  the 
problem  of  estimating  ^ is  repeated  several  or  even  hundreds  of  times 
a day,  then  perhaps  computation  time  or  the  difficulty  of  doing  calcula- 
tions by  hand  would  be  factors  to  be  considered.  In  many  applications 
one  needs  to  worry  about  the  possibility  of  large  recording  errors  or  highly 
unusual  observations  in  the  data,  in  which  case  one  of  the  estimators 
(4) -(6)  above  might  be  chosen.  Also,  there  may  be  considerable  evidence 
from  previous  experiments  (or  from  experiments  taking  place  concurrently) 
that  ought  to  be  considered  in  the  estimation  process. 

The  presentation  that  follows  will  be  restricted  primarily  to  con- 
sidering properties  of  estimators  that  are  commonly  used  and  are  of  relevance 
in  a wide  number  of  applications.  As  we  shall  see  below,  the  imposition 
of  certain  goodness  criteria  leads  to  unique  "best"  estimators  of  many 
of  the  parameters  of  the  distributions  introduced  in  the  previous  sections. 
Although  the  sense  in  which  these  estimators  are  best  is  narrowly  defined 
and  does  not  Include  such  factors  as  ease  of  computation  and  cost  of 
sampling,  these  estimators  have  been  widely  adopted  in  practice,  and 
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many  of  these  estimators  satisfy  other  goodness  criteria  that  are  not 
listed  here. 

Definition.  Let  6 ■ be  an  estimator  of  a paraxoeter  g(e). 

The  bias  of  6 la  defined  by 

B(e)  - E (6)  - g(e). 

W 

If  E (6)  * g(u)  for  all  values  of  6,6  Is  said  to  be  an  unbiased 
6 

estimator  of  g(6)« 

The  subscript  6 on  the  expectation  sign  Is  Included  to  remind  the 

reader  that  the  distribution,  and  hence  the  expectation, of  6 depends 

on  6*  Note  that  unbiasedness  requires  that  E (6)  ■ g(6)  for  all 

6 

possible  values  of  6«  In  order  for  this  definition  to  be  meaningful, 

the  set  of  possible  values  of  6 must  be  specified.  In  the  absence  of 

any  explicit  specification  of  the  parameter  set,  we  shall  assume  that  the 

set  of  possible  values  of  6 is  the  "usual"  parameter  set  for  that  model. 

2 

For  example.  In  considering  estimates  of  and  a In  the  N(^,<r  ) 
case,  the  "usual"  parameter  set  Is  {(4,or):  - 0<a<"}» 

However,  In  certain  applications  one  may  want  to  restrict  the  possible 
values  of  to  some  subset  of  the  line,  e.g.,  to  the  nonnegative  real 
numbers.  The  usual  parameter  sets  for  many  of  the  other  distributions 
that  will  be  considered  In  this  section  are  given  In  Table  1,  Section  VI. 

Other  things  equal,  we  would  ordinarily  prefer  unbiased  estimators 
or,  at  least,  those  for  which  the  bias  B(0)  Is  small  for  those  values 
of  6 that  are  deemed  most  likely.  As  an  Indication  that  criteria  other 
than  unbiasedness  are  of  more  Importance  In  choosing  estimators,  consider 
choosing  between  an  unbiased  estimator  that  has  large  variance  and  one 
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that  is  biased  but  has  a distribution  that  Is  much  more  concentrated 

about  the  parameter  being  estimated  for  all  values  of  Q.  Clearly, 

what  Is  needed  In  choosing  among  estimators  are  measures  of  how  close 

the  values  of  the  estimators  are  to  the  parameters  being  estimated. 

Some  simple  measures  that  have  been  proposed  In  the  past  are:  (a)  mean 

2 

squared  error  E [6  - g(e)]  , (b)  mean  absolute  error  E (Ifi-  g(6)l), 

6 6 

and  (c)  P»(|6  - g(6) | > c) , the  probability  that  6 misestimates  the 

6 

parameter  g(B)  by  more  than  c units.  Each  of  these  measures  of 

closeness  is  an  Instance  of  E L(6,g(6))  where  L is  a "loss  function" 

6 

that  specifies  the  loss  suffered  If  the  estimated  value  Is  6 and  the 
parameter  value  Is  g(6) * Although  this  more  general  approach  would 
seem  to  apply  In  more  situations,  In  actual  practice  loss  functions  can 
rarely  be  specified  precisely,  and  we  shall  not  pursue  this  approach.  For 
the  purposes  of  this  presentation,  we  shall  concentrate  most  of  our 
attention  on  the  first  of  the  three  measures  of  closeness  above.  It  Is 
the  easiest  to  work  with,  since  the  mean  squared  error  of  an  estimator 
bears  a simple  relationship  to  Its  bias  and  Its  variance. 

Theorem  9-1.  If  6 is  an  estimator  of  g(0)  with  bias  B(e), 
the  mean  squared  error  of  6 satisfies 

E„[6  - g(0)l^-  Var„(6)  + [B(e)l^. 

Proof:  This  theorem  Is  merely  a restatement  of  the  easily  verified 

fact  that,  for  any  random  variable  Y having  mean  and  finite  vari- 
2 

ance  Oy  , 

E(Y  - c)^  - - c)^. 

f 

The  verification  is  left  as  an  exercise. 
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It  follows  from  the  theorem  that.  If  6 Is  unbiased  for  g(0) , 

then  the  mean  squared  error  of  6 Is  just  the  variance  of  6*  In 

many  cases,  there  Is  a unique  unbiased  estimator  of  a parameter  that 

has  minimum  variance  for  every  possible  parameter  value  0. 

Definition.  An  unbiased  estimator  6*  Is  said  to  be  the  uniformly 

* 

mlnlmvim  variance  unbiased  (UMVU)  estimator  of  a parameter  g(0)  If  6 

has  minimum  variance  for  all  values  of  9.  In  this  case  the  efficiency 

^ * 
of  any  other  estimator  6 relative  to  6 Is  defined  to  be  the  ratio 

Varg(6*)/Varg(6) . 

2 

For  example,  suppose  X^,X2,...,X^  are  l.l.d.,  each 

Of  the  nine  estimators  of  ^ that  were  listed  earlier,  the  first  six 

are  all  unbiased.  Estimators  (3) -(6)  are  all  Instances  of  weighted 

averages  order  statistics  X^^  »*(2)  ' ' * * *^(n) 

“ 1 and  w,  ■ w , for  k ■ l,2,...,n.  That  Is,  X,,.  and  X,  . 

1 k n-k+1  t » » » 

receive  the  same  weight,  X^2)  *(n-l)  the  same  weight,  etc. 

If  n - 20,  the  variances  of  the  first  six  estimators  of  ^ are 

(1)  Var(X)  - a HQ  - 0.05<y^. 

(2)  Var(Xj^)  - a. 

(3)  Var([X^j  + X^^j]/2)«  O.U3cr^. 

(4)  Var(mdn(X^"^))  - 0.073o^. 

(5)  Var([X^gj  + X^j^5j]/2)- 0.061a^. 

(6)  Var([X^2)  + *(3)  + •••  + *(19) “ O.OSla^. 

(See  W.  J.  Dixon  and  Frank  J.  Massey,  Jr.,  Introduction  to  Statistical 
Analysis.  Second  Edition,  McGraw-Hill,  New  York,  p.  406.)  Thus,  among 
these  unbiased  estimators,  X has  smallest  variance. 
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It  will  be  shown  later  In  this  section  that  X Is  the  UKVU  esti- 
mator of  In  this  case.  The  efficiency  of  the  median  relative  to 

X Is  0.05/0.073-0.68.  It  can  be  shown  that  for  large  values  of  n the 

2 

variance  of  the  sample  median  Is  approximately  (Tr/2)a  /n,  so  that  for 
large  n the  efficiency  of  the  median  relative  to  X Is  approximately 
2/rr  - 0.64.  The  Implication  of  this  Is  that  X achieves  approximately 
the  same  precision  as  the  sample  median  using  only  64  percent  as  many 
observations. 

We  note  In  passing  that  the  "trimmed  mean"  estimator  (6)  above 

has  efficiency  0.98.  This  estimator  Is  almost  as  efficient  as  X,  and 

It  affords  some  protection  against  gross  recording  errors  and  "wild 

shots"  In  the  data  by  eliminating  the  largest  and  smallest  observation 

from  the  calculation  of  the  estimate. 

The  other  estimators  (7)-(9)  are  biased  estimators  of  4,  but  each 

of  them  has  smaller  mean  squared  error  than  X for  certain  values  of 

The  mean  squared  error  of  the  estimator  which  estimates  ^ to  be 

equal  to  c for  all  values  of  the  observations.  Is  equal  to 
2 2 

E (6  “ pk)  “ (c  - n)  , which  Is  less  than  the  mean  squared  error  of 

0 c 

_ 2 

X,  namely  a /n,  for  values  of  ^ close  to  c. 

The  estimator  6 “ pc  + (l-p)X  has  bias 
E(6)  - - pc  + (l-p)pk  - u - p(c-ji)  , 

and  Its  variance  Is 

Var(6)  - (l-p)^Var(X)  - (l-p)^a^/n. 

Therefore,  by  Theorem  9-1,  the  mean-squared  error  of  6 Is 
E (6  - - (l-p)^a^/n  + p^(c-ii)^. 

Note  that  this  estimator  has  smaller  variance  than  X,  so  that  If  the  bias 
of  6 Is  not  too  large  (i,e,.  If  c Is  close  to  , then  6 


has  smaller 
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mean  squared  error  than  X. 

The  biased  estimator  6 * max(X,0)  has  smaller  mean  squared  error 
than  X for  all  positive  values  of  ^ since 
(6  - 1*)^  i (X  - 

for  all  possible  sample  values  with  strict  Inequality  holding  whenever 
— 2—2 

X < 0.  .It  follows  that  E (6  - < E (X  - for  all  positive  values 

6 V 

of  4. 

Exercises.  1.  Show  that,  for  any  random  variable  Y having  mean 

2 2 2 2 
I*  and  variance  o;^  < ",  E(Y  - c)  - 0;^  + (j*  - c)  . 

2.  Let  X^jX^,. .. (X^^  be  l.l.d.,  each  Bernoulli (p).  Compute  the 

bias,  variance,  and  mean  squared  error  of  each  of  the  following  estimators 
of  p:  (a)  X,  (b)  ^1^2*  estimator  having  value  1/2  for  all  values 

of  the  X^’s,  (c)  the  "constant  risk"  estimator  ^ 12^^'^* 

mean  squared  error  as  a function  of  p for  each  of  the  three  estimators. 
Ans.  (a)  0,  p(l-p)/25,  p(l-p)/25;  (b)  (l-2p)/2,  0,  (l-2p)^/4; 

(c)  (l-2p)/12,  p(l-p)/36,  1/144. 

2 

3.  Let  Xj^,  3^,...,X^  be  l.l.d.,  N(|/k.a)>  Consider  the  estimators 

- SS/n  and  - SS/(n-l)  where  SS  - E(X^  - X)^. 

2 2 

(a)  Show  that,  as  estimators  of  a , S has  smaller  mean  squared  error 

''2 

than  the  unbiased  estimator  a • (b)  Among  estimators  of  the  form 

a * cSS,  determine  the  value  of  c for  which  a has  smallest  mean 
squared  error.  Ans.  c ■ 1/ (n+1) . 
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Deflnltlon.  Suppose  Xj^,X2 , . . . ,X^  have  Joint  density  (or  proba- 
bility function)  o)  where  6 Is  a vector  of  unknown  parameters 

0 ■ (0j^»02»  • • • »0^j)  • If  the  observed  value  of  X^“^  - (Xj^,  X2,...,X^)  is 
, the  function 

L(e)  - f(x^"^  e) 

considered  as  a function  of  6 la  called  the  likelihood  function. 

If,  for  each  possible  value  of  X^”\  there  Is  a unique  value  of  Q, 
say  6(X^^^),  that  maximizes  the  likelihood  function,  then  the  estimator 
e ~ oCX'  determined  In  this  way  Is  called  the  maximum  likelihood 
estimator  (MLE)  of  0. 

In  discrete  cases,  using  the  maximum  likelihood  estimator  amounts  to 
choosing  0 to  maximize  the  probability  of  what  was  observed.  As  we  shall 
see  later,  maximum  likelihood  estimators  are  usually  very  good  estimators 
for  the  parameters  In  the  models  that  we  have  discussed  so  far. 

For  example,  suppose  that  X^,  X2,...,X^  are  l.l.d.,  BemoulU(9). 

Then 

L(e)  - f(x^"^  0)  - n 0*^(i-0)  - 0*'(i-0)“”*^ 

where  t ■ £x^.  We  distinguish  three  cases: 

(a)  t ■ 0.  In  this  case,  the  likelihood  function  Is  L(0)  - (1-g)*', 
a strictly  decreasing  function  of  Q on  the  unit  Interval  [0,1]  that 
achieves  Its  maximum  at  g > 0. 

(b)  t - n.  Here,  the  likelihood  function  Is  L(o)  ■ 6°»  which 

achieves  Its  maximum  at  6*1. 

(c)  0 < t < n.  In  this  case,  the  likelihood  function  Is 

l>(e)  “ 0^(l-0)*'  which  is  a polynomial  In  6 that  has  value  0 at  the 
end  points  of  the  unit  Interval  and  Is  positive  for  0 < 0 < 1.  The  MLE 
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e can  be  determined  by  setting  the  derivative  of  L(o)  equal  to  0 and 
solving  for  e.  However,  it  is  easier  to  determine  the  maxlattm  of  the 
logarithm  of  L(9) : 

log  L(e)  = t log  9 + (n-t)  log(l-e). 

Since  log  X is  an  Increasing  function  of  x,  the  value  of  9 that 
maximizes  log  L(9)  will  also  maximize  L(o) . Setting  the  derivative 
of  log  L(9)  equal  to  0 yields 


Solving  for  9 gives  9 ■ t/n  ■ Ex^^/n  ■ x.  We  conclude  that  the  MLE  of 

9 is  9 X.  This  estimator  is  unbiased  and  has  variance  9(l~B)/n. 

It  will  be  shown  later  that  X is  the  UMVU  estimator  of  9« 

When  the  joint  density  or  probability  function  f(x^”^;  9)  has 

several  unknown  parameters,  one  can  usually  find  the  MLE's  of  the  parameters 

by  setting  the  partial  derivatives  of  log  L(9)  with  respect  to  the 

parameters  9^  equal  to  zero  and  solving  the  resulting  equations. 

2 

For  example,  if  Xj^,X2, . . . ,X^  are  i.l.d.,  each  then 

n “(x  -4)^/2ct2  

L(vi,a)  ” n e - expt-z(x.-  n)^/2a  ]. 

i-1  a 


Since  E(x^-j*)^  = E(x^-x)^  - n(x-^)^. 


log  L(v*,0)  - -n  log  /lix  - n log  a - 


E(x . -x) ^ . 2 

i n(x-u) 

2 ” 2 
2a  2a^ 


Here  we  could  set  the  partial  derivatives  of  log  L(^,a)  with  respect 
to  ^ and  a equal  to  zero  and  solve  the  resulting  two  equations  for 
H and  Q.  However,  we  observe  that  j*  only  occurs  in  the  last  term 
on  the  right,  and  this  term  is  maximized  by  setting  4 ■ x.  Thus  the 
problem  reduces  to  choosing  a to  minimize  the  sum  of  the  other  terms. 
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2 

Setting  Slog  L(n,a)/S<T  ■ 0 and  solving  for  a yields 

2 — 2 
a “ E(x^  - x)  /n. 

Thus,  the  MLE's  of  ^ and  a are  X and  S - [E(Xj^  - X)^/n]^^^.  The 

MLE  of  functions  of  n and  a are  the  corresponding  functions  of  X 

2 2 

and  S.  For  example,  the  MLE  of  ct  Is  S , and  the  MLE  of  j*  + 1.28a 

is  X + 1.28S.  As  we  saw  earlier,  S is  a biased  estimator  of  a, 

but  It  has  smaller  mean  squared  error  than  the  usual  unbiased  estimator 
'^2  — 2 

a « E(X^  - X)  /(n-l).  The  MLE  of  a is  also  a biased  estimator. 

An  unbiased  estimator  of  a can  be  obtained  by  taking  a » c^a  where 

a is  the  square  root  of  a and 

- [(n-l)/2]^^^r[(n-l)/2]/r(n/2). 

The  values  of  c for  n 10  are 
n 


n 

2 

3 

A 

5 

6 

7 

8 

9 

10 

c 

n 

1.253 

1.128 

1.085 

1.064 

1.051 

1.042 

1.036 

1.032 

1.028 

For  n > 5,  Is  well  approximated  by  1 + 1/4 (n-l).'  Another  way  of 

representing  this  unbiased  estimator  is  in  the  form  a ■ [E(X^  - X)^/k.^]‘‘’'  , 
where  k * 2{r(n/2) /F[ (n-l) /2] )^.  The  values  of  k for  n £ 10  are: 


n ... 


n 

2 

3 

4 

5 

6 

7 

8 

9 

10 

k 

n 

0.637 

1.571 

2.546 

3.534 

4.527 

5.522 

6.519 

7.517 

8.515 

For  n > 10,  k^  is  approximately  equal  to  n - 3/2.  (See  John  Gurland 
and  Ram  C.  Tripathl,  "A  Simple  Approximation  for  Unbiased  Estimation  of 
the  Standard  Deviation,"  The  American  Statistician.  October  1971,  pp.  30-32.) 

Example.  Suppose  Xj^,  X2»...,X^  are  l.l.d.,  each  uniformly  dis- 
tributed on  (0,e)>  so  that  the  density  of  each  of  the  X^'s  can  be  speci- 
fied as  f(x^;0)  • 1/0  for  0 < x^  i 0. 
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Here,  the  likelihood  function  Is 

0 

L(e)  - ' 


If  < 0 or  > 0 for  some  1, 


1/e 


n 


If  0 < x^  i e for  1 “ l,2,...,n. 


Since  L(e)  Is  a decreasing  function  of  e for  8 -t 

and  L(e)  Is  zero  for  e < ^(q)»  1^  follows  that  L(8)  1«  oaxlalzed  by 


§ “ , and  the  MLE  of  8 is  0 “ • 


By  Exercise  1 below,  8 Is  a 


biased  estimator  of  8 with  expectation  E(e)  “ n8/(irfl)  and 
Var(e)  • n0^/(n+l)^(n+2) . 

Next  consider  estimating  ^ “ 8/2,  the  mean  of  the  X^'s.  The  MLE  of 
^ Is  - 8/2,  which  Is  again  a biased  estimator.  The  corresponding  un- 
biased estimator  of  ^ that  depends  on  Is  ■ (nrfl)§/2n,  which 

has  variance 

VarCjk)  “ (n+l)^Var(e)/An^  • 8^/An(n+2). 

How  does  this  compare  with  the  sample  mean  X?  Since  each  X^  has  variance 
8 /12,  Var(X)  “ 8 /12n.  Hence,  the  efficiency  of  X relative  to  Is 
Var(pD/Var(X)  ■ 3/(n+2).  Note  how  poorly  X performs  relative  to  ^ In 
this  case.  For  example.  If  n ■ 28,  the  variance  of  4 Is  only  one  tenth 
as  large  as  the  variance  of  X. 

Exercises . 1.  Let  Y “ X^^j  where  Xj^,  X2,...,X^  are  l.l.d.,  each 
Unlform(0,  0).  Show  that  (a)  the  density  of  Y Is  f(y)  ■ ny®  ^/0*^  for 
0 < y < 0,  (b)  E(Y)  - n0/(n+l),  and  (c)  Var(Y)  - n0^/(irt'l)^(n+2) . 

2.  Show  that.  If  X^,  X2,...,X^  are  l.l.d.,  each  having  a negative 

A __ 

exponential  distribution  with  parameter  X*  then  the  MLE  of  X is  X ~ 1/X. 

3.  Assume  that  Y^,Y2»...,Y^  are  Independent  random  variables,  and 

Y^  ~ N(o(Xj^,a^)  where  a,  X2,...,x^  are  known  constants.  Show  that 

(a)  the  MLE  of  a Is  a - Dtj^Y^^/EXj^  , (b)  or  la  an  unbiased  estimator  of 

2 2 

a with  variance  <r  /35t^  . 

4.  Show  that.  If  X^,X2«...,X^  are  l.l.d.,  each  Poisson  (X) » Chen 
the  MLE  of  X Is  X - X. 
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The  reader  should  note  that  no  optimality  properties  for  maximum  like- 
lihood estimators  were  stated  In  the  previous  section.  There  Is  a good 
reason  for  this  omission — namely,  the  fact  that  some  MLE's  are  poor  esti- 
mators. Sometimes  it  Is  asserted  that  MLE's  are  good  estimators  because 
they  have  desirable  "asjnnptotlc"  properties.  To  see  that  the  reasoning  be- 
hind this  assertion  Is  shaky,  let  us  first  define  our  terms. 

Definition.  Suppose  the  vector  of  observations  has  joint  density 

or  probability  function  f(x^"^;  e) , and  let  estimator 

(or,  more  precisely,  a sequence  of  estimators)  of  a parameter  y * g(6). 

The  sequence  y^  is  said  to  be 

A 

(a)  consistent  If  y^  tends  to  y In  probability  [i.e.,  for  any  s > 0, 

P-(Iy  ~ yI  «£  *)  tends  to  0 as  n becomes  Infinite]; 

0 *n 

2 

(b)  asymptotically  normal  with  mean  y and  variance  a /n  If  the 

A 

distribution  of  ~ tends  to  a standard  normal  distribution, 

(c)  best  asymptotically  normal  (BAN)  if  y^  is  asymptotically  normal 

2 ~ 

with  mean  y and  variance  a /n  and.  If  y^  Is  any  other  asymptotically 

2 2 2 

normal  sequence  with  mean  y and  variance  t /n,  then  a ^ T • 

A ^22 

since  P„(|y  - yl  i e)  S E„(y  - y)  /e  by  Theorem  5-6,  and 

0 *n  0 *n 

A 

where  Bj^(6)  the  bias  of  y^,  to  prove  consistency  It  suffices  to  show 

A A 

that  Var(y^)-K)  and  E(y^)*^  for  all  values  of  0.  Thus  for  example, 

2 — 

if  X2,...,X^  are  l.l.d.,  each  N(^,a),  then  X Is  a consistent, 

asymptotically  normal  estimator  of  but  so  are  the  following  ridiculous 


estimators: 


-124- 


(a)  the  average  of  and  every  thousandth  observation  thereafter » 

ao 

(b) 


10 


(c) 


17  if  n < 10 
X if  n i 10 

(nX  + 10^°)/(n+l). 


The  point  of  these  examples  is  that  consistency  says  nothing  about  the 
goodness  of  an  estimator  for  small  samples  or  even  for  very  large  ones. 
Conversely,  inconsistent  estimators  may  still  be  good  In  small  samples. 
As  a frivolous  example  in  the  normal  case  above,  consider 


X If  n < 10^° 

0 if  n k 10^°  . 


The  reason  for  citing  asymptotic  properties  of  estimators  Is  that 
In  many  cases  It  Is  difficult  to  determine  the  properties  of  estimators  In 
small  samples,  but  methods  exist  for  determining  their  asyiq>totlc  distribu- 
tions. A second  reason  Is  based  on  the  wishful  thinking  that  those  estimators 
that  have  desirable  asymptotic  properties  will  also  prove  to  be  good  In  small 
samples. 

For  what  It  Is  worth.  If  X^,  X2»...,X^  are  l.l.d.,  each  having 

A 

density  or  probability  function  f(x^;  6) > and  If  0 Is  the  MLE  of  Q, 

A 

then  8 is  a consistent,  BAN  estimator  of  8 provided  that  f(x^;  8) 

satisfies  certain  regularity  conditions.^  On  the  other  hand,  examples  exist 

2 

to  show  that  MLE's  need  not  be  consistent. 

Since  the  method  of  maximum  likelihood  sometimes  leads  to  poor  esti- 
mators, the  reader  may  wonder  why  we  have  devoted  so  much  space  to  this 
topic.  The  reason  Is  that  there  is  no  single  method  for  deriving  good 


For  a comprehensive  discussion  of  maximum  likelihood  estimation, 
see  M.  G.  Kendall,  and  Alan  Stuart,  The  Advanced  Theory  of  Statistics, 

Vol.  2,  Hafner  Publishing  Company,  New  York,  1961,  Chapter  18. 

2 

See  Kendall  and  Stuart,  Ibid. , p.  61.  Also,  R.  R.  ^ahadur,  "Examples' 
of  Inconsistency  of  Maximum  Likelihood  Estimates,"  Sankhya , December  1958, 

pp.  207-210. 
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esclmacors,  and  the  maximum  likelihood  estimators  provide  a reasonable 

A A A 

Starting  point.  Another  reason  Is  that,  If  T ■ (Bj^,  where 

B^  Is  the  MLE  of  B^>  then  It  often  happens  that,  If  g(B)  Is  & parameter 

ic 

for  which  a UMVU  estimator  6 exists,  then  6 Is  usually  either 
g(T)  or  some  multiple  of  g(T) . Moreover,  T Is  usually  a "sufficient” 
statistic  for  B« 

Definition.  Let  have  joint  density  or  probability  function 

f(x^”^;  B)  • A statistic  T Is  said  to  be  sufficient  for  B ■ (0j^»02  * • • • 
If  the  conditional  distribution  of  given  T * t,  does  not  depend  on 

0. 


The  Importance  of  a sufficient  statistic,  which  may  be  a single  random 
variable  or  a vector  of  random  variables  T ■ (T, , T- T ) , Is  that  It 

X / m 

summarizes  all  the  Information  about  B that  Is  contained  In  the  sample 
values.  Since  the  conditional  distribution  of  given  T ■ t,  does 


not  depend  on  B>  It  follows  that  the  conditional  distribution  of  any  other 
statistic  U ■ uCX^*^^)  does  not  depend  on  B either.  Since  the  conditional 
distribution  of  U Is  the  same  for  all  B,  knowing  the  value  of  U cannot 
provide  any  additional  Information  about  the  value  of  B 

Example . Let  Xj^,  X2 X^  be  l.l.d.,  Bernoulli (b)  . To  see  that 

T ■ Is  sufficient  for  b»  consider 

p (x(n)  _ x^"^|T  - t)  - P (X^”^  - x^“\  T - t)/P  (T  - t) . 
o 0 6 

The  numerator  on  the  right  Is  zero  unless  t > Dc^,  In  which  case 


p , T - t) 

6 


Dc.  n-Dc 

B ^(1  - B)  - B (1-B)  . 


Since  Pg(T  - t)  - (”)  B*^(l-0)”“*^ , 


(")  . ll  - t)  - 


1/(J) 


If  t y Ex, 


If  t - Ex^. 
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The  expression  on  the  right  Is  free  of  B,  completing  the  proof  that  T Is 
sufficient  for  B. 

In  general.  It  Is  hard  to  establish  sufficiency  directly  from  the 
definition  as  was  done  In  this  case.  Fortunately,  the  following  theorem 
enables  us  to  spot  sufficient  statistics  easily  from  the  joint  density  or 
probability  function  of  . 

Theorem  9-2.  (Flsher-Neyman  Factorization  Theorem.)  A statistic 
T » t(X^”^)  Is  a sufficient  statistic  for  6 If  and  only  If  the  joint 
density  or  probability  function  of  X^”^  can  be  factored  Into  two  parts 


f(x^"^;  8)  - g(t,6)  h(x^"^). 


(n). 


where  g(t,B)  depends  only  on  t ■ t(x^  '')  and  the  parameter(s)  B»  nnd 

h(x^”^)  does  not  depend  on  8,. 

Example . In  the  Bernoulli  case  above, 

. . Ex  n-Ex. 

f(x^"^  8)  - 6 (1-8) 

Here,  we  can  apply  the  Factorization  Theorem  by  setting  h(x^"^)  ■ 1 and 
g(t,e)  ■ e^Cl-e)”"*"  where  t ■ Ex^.  It  follows  that  T ■ EX^  Is  a 
sufficient  statistic  for  Q, 

2 

Example . Let  X^,  X2,...,X^  be  l.l.d.,  each  N((k,a  ) where  both 
2 

4 and  a are  unknown.  Then  . _ 

^ -n  -E(*^-u)  /2a 

f(x  n,o)  “ a)  e 

2 — 2 — 2 

Since  E(Xj^-j*)  ■ E(x^^-x)  + n(x-|»)  , It  follows  from  the  Factorization  Theorem 

by  setting  h(x^”^)  - 1 that  T « (X,  E(X^-X)^)  Is  a (set  of)  sufficient 

2 

statlstlc(s) . If  a la  known,  then  using  the  factorisation 

we  see  that  X Is  sufficient  for  If  ^ Is  known,  then  It  follows  from 

2 2 

the  first  representation  above  that  E(Xj^-^)  is  sufficient  for  a • 
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NoCe  that  It  follows  from  the  Factorization  Theorem  that.  If  T Is 
sufficient  for  0,  and  U ■ u(T)  is  some  one-to-one  function  of  T,  then 
U Is  also  sufficient  for  0.  For  example,  in  the  Bernoulli  case  above, 

knowing  that  T * Is  sufficient  for  0 Implies  that  X is  also 

— 2 — “2  2 

sufficient  for  0.  In  the  normal  case,  (X,  S ),  (X,  a ),  and  (EX^,  EX^^  ) 

are  all  sufficient  statistics  for  and  a- 

Theorem  9-3.  (Rao-Blackwell  Theorem.)  Let  T be  a sufficient 

statistic  for  0,  and  let  6 be  any  unbiased  estimator  of  g(0).  Then 

6 * E(6|T)  is  also  an  unbiased  estimator  of  g(0)  and  Var  (6  ) ^ Var  (6) 

B 0 

with  strict  Inequality  holding  unless  6 Is  a function  of  T.  If  5 Is 

* 

a biased  estimator  of  g(0) , then  6 has  the  same  bias  as  6 for  all 

values  of  6,  and  Var(6  ) ^ Var(6) » Implying  that  the  mean  squared  error 

* 

of  6 Is  at  least  as  small  as  that  for  & for  all  0. 

Proof:  6 ~ E(6|T)  Is  a function  of  T alone  (and  does  not  depend  on  0) 

since,  given  T ■ t,  the  conditional  distribution  of  6 " is 

Independent  of  0.  Hence,  the  conditional  expectation  of  6,  given  T,  Is 

if 

Independent  of  0.  The  estimator  6 has  the  same  bias  as  6,  because 
* 

6 and  6 have  the  same  expectation  for  all  values  of  0 by  Theorem  7-6 (a) T 
E.(6*)  - E„(E(6|T))  - E„(6). 

U u V 

The  fact  that  Var(6  ) ^ Var(6)  follows  from  Exercise  4(b),  page  92: 

Var^(6)  - E^[Var(6|T)l  + Var^[E(6|T) ] , 

0 0 0 

and  the  fact  that  Var(6|T)  2:  0.  Note  that  Var  (6  ) < Var  (6)  unless 

0 6 

Var(6|T)  - 0,  which  would  Imply  that  6 Is  a function  of  T. 

An  implication  of  the  theorem  Is  that  any  estimator  that  Is  not  a 
function  of  a sufficient  statistic  can  always  be  Improved  upon  by  an  estimator 
that  is  a function  of  a sufficient  statistic.  For  example,  suppose  Xj^,  X2,...,X^ 

are  i.l.d.,  each  having  a uniform  distribution  on  (0,0).  By  writing  the 
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Jolnt  density  of  the  observations  In  the  form 

f(x^"^;  e)  “ (1/e”)  ^ 

where  sJ  0)  Is  1 If  ^ 0 ® otherwise,  we  see  from 

the  Factorization  Theorem  that  T > ^(n)  ^ sufficient  statistic  for  e. 

Consider  estimating  <■  0/2,  the  mean  of  the  X^'s.  Two  unbiased  esti- 
mators of  are  and  X,  neither  of  which  are  functions  of  the  suf- 

ficient statistic.  It  follows  that  E(X^|t)  and  E(X|T)  are  unbiased 
estimators  of  4 having  smaller  variance  than  either  X^  and  X.  It 
turns  out  that  E(Xj^It)  - E(x|T)  ■ j*,  where  4 ■ (n+l)T/2n.  This  Is  the 
same  estimator  of  ^ that  was  derived  on  page  122. 

The  Rao-Blackwell  Theorem  would  seem  to  provide  a useful  tool  for 
Improving  upon  estimators.  However,  the  tool  Is  rarely  used  since.  In 
the  commonly  used  statistical  models  In  which  the  density  or  probability 
function  fCx^*'^;  e)  Is  known  except  for  the  parameter  values 
0 ■ (0|^.02»  • • • >0j^)  > the  "standard"  estimators  are  either  MLE's  or  functions 
of  the  MLE's,  and  It  follows  easily  from  the  Factorization  Theorem  Chat, 

If  T Is  sufficient  for  6,  then  the  MLE  6 of  g(0)  Is  a function  of 
T,  say  6(T).  Since  E(6(T) | T)  - 6(T)  by  Theorem  7-6 (b),  MLE's  are 
unaffected  by  conditioning  on  a sufficient  statistic. 

Although  It  Is  not  true  In  general.  It  often  happens  that  If 
T - (0j^,  02»***.0|^)  where  Is  the  MLE  of  0^,  then  T is  a "minimal" 

sufficient  statistic  for  0,  l.e.,  all  other  sufficient  statistics  are 
functions  of  T.  Moreover,  It  often  happens  that  the  unbiased  estimator 
of  a parameter  g(0)  that  depends  on  T Is  unique  In  the  sense  Chat  If 
6j^(T)  and  62(T)  are  two  unbiased  estimators  of  g(e) , then  6j^(T)  ■ 62^'^^ 
except  perhaps  on  a subset  of  the  sample  space  that  has  probability  zero 
for  all  values  of  0.  Under  these  circumstances.  It  then  follows  from  the 
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Rao~Blackwell  Theorem  that.  If  one  can  find  a single  estimator  5 

* 

that  Is  a function  of  T,  then  6 Is  the  UMVU  estimator  of  g(6).  Any 

other  unbiased  estimator  6 can  be  Improved  upon  by  E(6|T),  but  this  Is 

* 

an  unbiased  estimator  that  depends  on  T,  and  by  assumption  6 Is  the 
unique  unbiased  estimator  that  Is  a function  of  T. 

We  shall  now  define  a property  of  sufficient  statistics  T that 
assures  uniqueness  of  unbiased  estimators  that  are  functions  of  T. 

Definition.  A statistic  T Is  said  to  be  complete  If  the  only  real- 
valued functions  h(T)  satisfying  E th(T)]  « 0 for  all  values  of  0 

0 

are  those  for  which  P {h(T)  ■ 0}  ■ 1 for  all  0. 

0 

To  see  that  unbiased  estimators  that  depend  on  a complete,  sufficient 

statistic  are  unique  In  the  sense  specified  above,  suppose  6j^(T)  and  62(1^) 

are  two  unbiased  estimators  of  the  same  parameter  g(0) . Then 

E l6, (T)  - 6~(T)]  “ 0 for  all  values  of  0.  By  the  definition  of  completeness, 
0 1 L 

It  follows  that  6j^(T)  ■ 62 (T)  except  perhaps  on  a set  probability  zero. 

Theorem  9-4.  (Lehmann-Scheffd  Theorem.)  Suppose  T Is  a complete, 
sufficient  statistic  for  0,  and  g(0)  has  at  least  one  unbiased  esti- 
mator. Then  g(0)  has  a unique  UMVU  estimator  that  depends  on  T. 

Proof:  Let  6 be  any  unbiased  estimator  of  g(0).  Then,  by  the 

Rao-Blackwell  Theorem,  6 ■■  E(6|T)  Is  again  unbiased  for  g(0) , and 

Var(6  ) ^ Var(6)  with  equality  holding  If  and  only  If  6 Is  a function 
of  T.  Since  T Is  complete,  6*  Is  the  unique  unbiased  estimator  of 
g(0)  depending  on  T. 

Many  of  the  statistical  models  that  we  have  considered  have  complete, 
sufficient  statistics  T that  can  be  determined  by  setting  T ■ (0j^»02»  • • • >0j^) 
where  0^  Is  the  MLE  of  0^.  The  sufficiency  of  the  statistics  T In 
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the  examples  below  can  be  verified  by  the  Factorization  Theorem.  The 

proofs  of  the  completeness  of  many  of  these  statistics  are  special  cases 

of  a theorem  that  can  be  found  in  E.  L.  Lehioann,  Testing  Statistical 

Hypotheses.  John  Wiley  & Sons,  New  York,  p.  132, 

Bernoulli.  If  X,  , X«,...,X  are  i.i.d.,  Bemoulll(e)>  then  X 
1 z n 

is  complete  and  sufficient  for  0.  Since  X is  unbiased  for  0,  it  is 
the  UMVU  estimator  of  0.  Let  g(0)  “ 0(l~8)/n,  which  is  the  variance 
of  X.  Since  X(1  - X)/(n-l)  is  an  unbiased  estimator  of  g(0)  that 
depends  on  X,  it  follows  that  X(1  - X)/(n-l)  is  the  UMVU  estimator 
of  g(0). 

Poisson.  If  X,,  X. X are  i.i.d.,  Poisson (x) , then  the  MLE 

1 z n 

of  X is  X,  which  is  a complete  and  sufficient  statistic.  Since  X is 

unbiased  for  X>  it  is  the  UMVU  estimator  of 

Geometric.  If  X^^,  X2 X^  are  i.i.d.,  each  Geometric  (p)  , then 

the  MLE  of  p is  1/X.  This  is  a biased  estimator,  but  it  is  complete 

and  sufficient  for  p.  The  UMVU  estimator  of  p is  (n-l)/(EXj^  - 1)  if 

n > 1.  If  n - 1,  the  UMVU  estimator  of  p has  value  1 if  Xj^  « 1 and 

0 if  X^  > 1.  (In  this  case,  the  UMVU  estimator  is  absurd.) 

2 

Normal.  Suppose  X^,  X2,...,X^  are  i.i.d.,  each  N(pk>a)< 

2 _ 

(a)  If  both  and  a are  unknown,  the  MLE  of  0 - (u,a)  is  T - (X, 

^ . 2 

which  is  complete  and  sufficient  for  0.  Let  - X,a  ■ E(X^-X)  / (n-1) , 
and  CT  “ c^a  where  c^  is  defined  on  page  121.  Since  these  are  un- 
biased estimators  that  are  functions  of  the  complete,  sufficient  statistic 

2 

T,  they  are  the  UMVU  estimators  of  a • and  a« 

2 — 

(b)  If  a la  known,  X is  complete  and  sufficient  for  ^.  Hence, 

X is  the  UMVU  estimator  of 

2 ''2  2 , 

(c)  If  is  known,  the  MLE  of  a is  a • /n,  which  is 

"2 

complete  and  sufficient.  Since  a is  unbiased,  it  is  the  UMVU  estimator 


of  o • 
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Two-sample  Normal.  Suppose  Xj^,  X2,...,X^,  Yj^,  Y2»...,Y^  are  Inde- 
pendent, X^  ~ N(5,o^)  , Yj  ~ Ndl.T^)  . 

— 2 — 2 

(a)  If  all  parameters  are  unknown,  the  MLE's  (X,  S„  , Y,  S ) are 

A X 

2 2 

complete  and  sufficient  for  0 ■ (§,  a > Tit  T )•  Hence,  the  UMVU  estimators 

2 2 — ^2  “ ^2  — — 
of  §t  a , Tl,  T t and  § - T1  are  X,  a , Y,  t , and  X - Y. 

2 2 2 

(b)  If  T * a (by  assumption)  and  §,  T|,  and  a are  all  unknown, 

then  the  MLE's  X,  Y,  and  - [E(X^-X)^  + E(Yj-Y)^]/(m+n)  are  complete 

— — 2 

and  sufficient.  It  follows  that  X,  Y,  and  (m4Ti)S  / (n+m-2)  are  the 

2 

UMVU  estimators  of  Tl,  and  <7  • 

Bivariate  Normal.  Let  (X, ,Y, ),  (X„ ,Y„) , . . . , (X  ,Y  ) be  a random 
■■■  ■ 11  z z n n 

sample  from  a bivariate  normal  distribution  with  parameters 

2 2 

P*  MLE's  of  these  parameters  are 

(X,  Y,  r)  where  ■ E(Xj^-X)^/n  and  r is  the  sample  cor- 

relation coefficient.  Since  these  statistics  are  complete  and  sufficient, 

the  unbiased  estimators  X,  Y,  ^ , and  are  the  UMVU  estimators  of 

2 2 

p^,  p^,  , and  Oy  • It  can  be  shown  that  r is  a biased  estimator  of 

2 

p with  mean  approximately  equal  to  p[l  - (1-p  )/2n].  Although  the  UMVU 

estimator  exists  [see  I.  Olkln  and  J.  W.  Pratt,  "Unbiased  estimation  of 

Certain  Correlation  Coefficients," Annals  of  Mathematical  Statistica,  Vol.  29 

(1958),  p.  201],  it  is  a complicated  function  of  r.  Olkin  and  Pratt 

2 

recommend  using  the  approximation  r[l  - (1-r  )/2(n-4)]. 

Exercise.  It  can  be  shown  that,  if  X, , X.,,...,X  are  i.i.d.,  each 
X z n 

Negative  Exponential (X) • the  MLE  of  X is  complete  and  sufficient.  Determine 

2 

the  UMVU  estimators  of  1/X  and  1/X  , the  mean  and  variance  of  the  X^'s. 
Ans.  X,  nX^/(n+l). 
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Assume  that  Y^,  Y2,>>«>Y^  are  Independent  random  variables  such 

2 

that  E(Y^)  “ T]^  and  Var(Y^)  “ < "*•  Let  6 be  some  parameter  such 

that  6 ■>  for  some  choice  of  constants  c^.  Then  6 has  at 

least  one  unbiased  "linear"  estimator — namely, 

A 

Definition.  Given  the  situation  above,  we  say  that  9 Is  the  best 

A 

linear  unbiased  estimator  (BLUE)  of  9 If  0 Is  a linear  fxmctlon  of 

the  Yj^'s  (l.e.,  9 “ ® minimum  variance  among  all  un- 

biased linear  estimators  of  9. 

The  expectation  and  variance  of  any  linear  estimator  9 sre 

A ^00 

given  by  E(9)  * and  Var(9)  ■ Ea^^  . Note  that  these  character- 
istics of  9 depend  only  on  the  means  and  variances  of  the  but 

no  further  assumptions  about  the  distributions  of  the  T^'s  are  needed. 
Thus,  one  can  determine  BLUE's  without  specifying  the  exact  distributions 
of  the  observations  Y^. 

Forecample,  suppose  Y^,  Y2,...,Y^  have  a common  mean  9 but  possibly 

2 

different  variances  , and  one  wants  to  find  the  BLUE  of  9.  This 

situation  applies  If  the  Y^,  Y^,...,Y^  are  a random  sample  from  any 

distribution  having  finite  variance.  In  which  case  the  have  a common 

2 

mean  9 and  a common  variance  a • More  generally.  It  applies  In  any 

situation  where  Yj^,  Y2,...,Y^  are  Independent  unbiased  estimators  of  the 

same  parameter  9.  For  example,  Y^  may  be  the  average  of  n^  l.l.d. 

2 

random  variables  having  mean  9 and  variance  a , In  which  case 
E(Y^)  - 9 and  Var(Y^)  - oln^. 

Theorem  9-5.  If  Y, , Y., , . . . ,Y  are  Independent  with  E(Y^)  - 9 and 
i z n 1 

2 ^ 

Var(Y^)  ” < ”»  the  BLUE  of  9 based  on  the  is  9 " where 

the  weights  w^  satisfy  Ew^  <■  1 and  are  Inversely  proportional  to  the 
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2 2 
variances  a^.  (l.e.,  ^1  " particular, 

If  the  Yj^'s  have  the  same  variance,  the  BLUE  of  0 Is  0 ■ Y. 

Proof;  Let  *0  - be  any  unbiased  estimator  of  0.  Since 

2 2 

E(^  ■ Laj^0  and  Var(^  ■ Ea^  , the  unbiasedness  condition  Implies 

that  2^^  « 1,  and  the  problem  reduces  to  finding  a vector  of  constants 

2 2 

a “ (aj^,a2, . . . ,a^)  to  minimize  f(a)  ■ Ea^  subject  to  the  condi- 
tion that  Ea^^  - 1.  In  minimizing  f(a)  on  the  set  A ■ {a:  Ea^  ■ 1}, 

one  can  Just  as  well  consider  minimizing 
g(a)  - + X (La^  - 1) 

where  X la  any  constant, since  the  functions  f and  g are  equal  on  A. 
The  trick.,  called  the  "method  of  Lagrange  multipliers,"  Is  to  use  differ- 
entiation to  find  the  value  a that  minimizes  g(a)  over  all  values  of 

n * 

a In  R . In  general,  the  components  of  a will  depend  on  X>  but  one 

* 

can  determine  a value  of  X ' for  which  a Is  In  A.  Since  this  value  of 

* n * * 

a minimizes  g over  R and  since  a Is  In  A,  a also  minimizes 

f over  A.  Here, 

- 2aia^^^  + X for  l-l,2,.,.,n, 

n * < 

and  It  follows  that  g(a)  Is  minimized  over  R by  a^  ■ -XYj^/2 

2 * 

where  ” 1/a^  • In  order  to  have  Ea^^  - 1,  we  choose  X “ -2/Ey^, 

* 

In  which  case  a^^  - 

Now  suppose  that  Y^,  Y2»..»,Y^  are  Independent  random  variables 

2 

with  means  E(Y.)  ■ or  + 0x.  and  Var(Y.)  - a where  x, , x.,...,x 
1 1 1 1 x n 

are  known  constants  such  that  not  all  of  them  are  equal,  and  the 
"regression  coefficients"  or  and  p are  parameters  to  be  estimated 


-134- 


from  the  observations.  In  the  absence  of  specific  assuiaptlons  about  the 

exact  distributions  of  the  search  for  the  BLUE’s  of  or  and 

Let  us  assume  for  the  moment  that  the  have  normal  dlstribu- 

2 

tlons,  i.e.,  ~ N(offpx^,  a )•  Led  by  the  hope  that  the  MLE's  of  a 

and  3 will  turn  out  to  be  linear,  consider  the  likelihood  function  In 
this  case: 


L(a,p,a  ) 


n -SS/2a 


1-1 


where 


SS  - - Of  - px^)^. 

Note  that  the  values  a and  b of  or  and  ^ that  maximize  the  likeli- 
hood function  are  the  values  of  a and  b that  minimize  the  sum  of  squares 
SS.  Hence,  the  MLE's  a and  b are  called  the  least-squares  estimators 
in  this  case.  The  partial  derivatives  of  SS  with  respect  to  or  and  p 
are: 


- 2Z(y^ 


Of  - pxp 


If  ■ “ ■ ^‘i^ 

Setting  these  partial  derivatives  equal  to  zero  and  solving  for  a and  ^ 
yields  the  MLE's 

b - Z(x^-x)Y^/Z(x^-x)^ 

a ■ Y - bx. 

Note  that  a and  b are  both  linear  estimators  of  or  and  Are  they 

unbiased?  To  verify  that  b is  unbiased,  we  compute 

E(x  -x)(oM-px  ) oC(x.-x)  pz(x  -x)x 

E(b) i rr^  “ — ^ **■  — • 

E(x^-x)  E(x^-x)  L(Xj^-x) 
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The  first  term  on  the  right  is  zero  because  • 0;  the  second  term 

— 2—2 

reduces  to  p because  E(x^-x)x^^  ■ Ex^^  - nx  , which  Is  another  way  of 
writing  the  denominator.  Hence,  E(b)  * p.  The  estimator  a is  also 
unbiased,  because 

E(a)  - E(Y)  - E(b)x  » E(crl-px^)/n  - px  - ff. 

2 ''2 

Incidentally,  the  MLE  of  a In  this  case  is  a “ where 

SSe  - E(Y^  - a - bx^^)^. 

'^2  2 '>'2 
Although  o is  a biased  estimator  of  c > the  estimator  a obtained  by 

dividing  the  "residual  sum  of  squares"  SS^  by  n-2  can  be  shown  to  be 

unbiased.  It  also  turns  out  that  SS^  Is  Independent  of  a and  b,  and 
2 

SS^/a  has  a chi-square  distribution  with  n-2  degrees  of  freedom. 

~2 

The  clincher  In  this  example  Is  that  a,  b,  and  a can  be  shown  to 

2 

be  complete  and  sufficient  statistics  for  the  parameters  or,  p,  and  a • 

~2 

It  follows  that  a,  b,  and  a ere  the  UMVU  unbiased  estimators  of  the 
parameters.  • 

What  Is  the  Implication  of  this  for  the  original  problem  of  finding 
the  blue's  of  a and  p?  Since  the  calculations  of  the  expectations  and 
variances  of  the  linear  estimators  a and  b do  not  use  the  normality 
assumptions,  a and  b are  unbiased  linear  estimators  of  or  and  p 
whether  the  T^^'s  have  normal  distributions  or  not.  Moreover,  they  must 
be  the  BLUE's  of  a and  p,  because  If  there  were  another  unbiased  linear 
estimator,  say  b^,  which  had  smaller  variance  than  b,  then  b^  would 
be  a better  unbiased  estimator  than  b In  the  normal  case,  contradicting 
the  fact  that  b Is  UMVU  in  the  normal  case.  This  completes  the  proof 


of  the  following  theorem: 
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Theorem  9-6.  If  Y, , Y_,...,Y  are  Independent  observations  with 
12  n 

2 

E(Y. ) « or  + Bx^  and  Var(Y.)  ■ <j  where  x, , x.  ,...,x  are  given  con- 
i 1 1 1 / n 

stants,  then  the  BLUE's  of  a and  3 are  the  least  squares  estimators 
a « Y - bx  and  b ■ E(x^-x)Yj^/E(Xj^-x)^.  Moreover,  if  the  observations 
Y^  are  normally  distributed,  then  a and  b are  the  UMVU  estimators  of 
or  and  p . 

It  follows  from  the  derivation  above  that,  if  y “ + C23  is 

any  linear  function  of  the  parameters  of  and  p,  then  the  BLUE  of  y 
y «•  Cj^a  + C2b.  (In  the  normal  case,  y is  the  UMVU  estimator  of  y,) 

In  particular,  the  BLUE's  of  the  expected  values  E(Y^)  ■ or  + gx^  are  the 
"fitted  values"  a + bx^.  Sometimes  the  primary  purpose  of  estimating  or 
and  p is  to  predict  the  expected  value  of  a future  value  of  Y at  x ■ x^. 
The  BLUE  of  E(Y)  - a + px^  is  a + bx^.  Its  variance  is  given  In 
Exercise  2 below. 

Exercises.  1 . Show  that , if  » ^2 ’ ' “ Independent  random 

2 

variables  having  possibly  different  means  but  the  same  variance  a • then 
Var(Ec^Y^)  « and  Cov(Ec^Y^,iy^Yj^)  - or^c^^d^. 

2.  Use  part  (a)  to  show  that.  If  a and  b are  the  BLUE's  In  the 
theorem  above,  then  Var(b)  •*  o^/SS(x),  Var(a)  ■ a^((l/n)  + x^/SS(x)], 
Cov(a,b)  ■ -xa^/SS(x),  and  Var(a+bXp)  - o^[(l/n)  + (Xq-x)^/SS(x) ] where 
SS(x)  « E(x^-x)^. 

3.  The  "residual"  e^^  corresponding  to  Y^  Is  defined  by 

e^  - Y^  - a - bx^.  Show  that  (a)  Ee^  ■ 0,  (b)  E(e^^)  “0,  and 

(c)  Cov(e^,a)  - Cov(e^,b)  - 0. 
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As  a generalization  o£  the  ''simple  linear  regression"  model  con- 
sidered above,  let  ^2'  ^2*  * * * * ^n  independent  random  variables 
with  means 

E(Yj^)  - + 32  *2i  '*’  **•  "*'  Pp  *pl 

where  ^^21’***’  *pi^  given  constants  and  the  Pj*® 

known  parameters.  In  addition,  assume  that  the  Y. 's  have  the  same 

2 ^ 
variance  q , and  the  columns  of  the  matrix  X below  are  linearly 

Independent : ^ 


*11 

*21 

*pi 

*12 

^22  • • • 

*P2 

*ln 

• • • 

Xm  • • • 

2n 

*pn 

/ 


To  see  that  this  model  Includes  simple  linear  regression  as  a special 

x..^  <■  X.  tor  1 ■■  .,n  wnere  x, , x.,,  . . . ,x 


case,  set  Xj^^  “ 1 and 


x^  for  1 ■ 1,2,..  .,n  where  Xj^,  X2, 


21  1 ...  i'  z ' ' n 

are  the  values  of  the  "Independent  variable."  In  this  case  the  condition 

that  the  columns  of  X be  linearly  Independent  amounts  to  requiring  that 

not  all  of  the  x^'s  have  the  same  value. 

Let  b- , b„,...,  b denote  the  least-squares  estimators  of  the 
1 Z p 

parameters  3j , i.e.  , the  bj's  are  the  values  of  the  Pj'®  mini- 

mize 


SS  — 2 “ Pj^  ^11  ~ ^2  *21 


- Pp  ^i)'. 


Theorem  9-7.  (Gauss-Markov  Theorem.)  Under  the  above  assumptions 
the  least  squares  estimators  b^  are  the  BLUE's  of  the  regression  co- 
efficients 3 j » Y ■ linear  combination  of  the  Pj'®* 

the  BLUE  of  y is  Y - 2Cjb^. 


^The  columns  of  X are  said  to  be  linearly  dependent  if  there  exist 
constants  a^^,  a2,...,  a^,  not  all  of  which  are  zero,  such  that  ®J*Ji  “ ^ 

for  l>l,2,...,n.  In  this  case,  one  of  the  columns  of  X is  a linear 
combination  of  the  others. 
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A proof  of  this  theorem,  which  is  of  fundamental  Importance  In  many 
applications,  can  be  given  by  mimicking  the  proof  above  for  the  case  of 
a single  "Independent  variable"  x.  If  the  have  normal  distribu- 
tions, then  the  UMVU  estimators  of  the  0j's  and  y 

the  UMVU  estimator  of  y this  case,  the  residual  sum  of  squares 
SS  , obtained  by  substituting  the  b, 's  for  the  3^*8  In  SS  above, 

8 j 2 2 ^ 

is  Independent  of  the  b.'s,  and  SS  /a  ~x  (n-p)  • Whether  the  Y. 's 

j e X 

are  normally  distributed  or  not,  the  estimator  SS  /(n-p)  Is  unbiased 

f 2 ® 

for  o . 


Exercises.  1. 


are  l.l.d.  random  variables 


Suppose  Y. , Yy,. , Y 
2 * — n 

and  variance  a . Show  that  the  least-squares  estimator 


SSg  - E(Y^  - Y)^. 


with  mean  8 

of  8 ^8  8 * Y nnd  Che  residual  sum  of  squares  Is 

2.  Consider  the  problem  of  comparing  the  means  8j^>  ^2 ^1 

of  I populations  on  the  basis  of  Independent  random  samples  of  sizes 
n^,  02,...,  Oj  from  the  respective  populations.  Let  Y^^j  denote  the 
J observation  from  the  1 population.  Assuming  that  Che  observations 
Y^j  have  Che  same  variance,  show  that  the  least-squares  estimators  of 


Che  means  are  84 
,th 


where 


Y^  Is  the  sample  mean  of  the  observations 


In  the  1''“  group.  Also,  show  that  the  BLUE  of  any  linear  combination  of 

— 2 2 
the  means,  ECj^8jL»  ^1^1  variance.  Ana.  a /n^. 

3.  Consider  Che  same  situation  as  In  Exercise  2 except  Chat 


E«ij) 


+ 1^, 


where  the  values  z. 


Chat  Che  least-squares  estimators  of  the  parameters  are 


are  known  constants.  Show 


and 


E / E <z„  - z 

i.3  ^ ^ l.j  ■' 


Si  - Yl  - yzi- 

show  that  the  variances  of  these  estimators  are  given  by 
Var (y) 


Var(8^)  - a 


_ 2 
*1 


ECZy  - Zj)' 


Also 


r 


t 


